open-design/skills/hyperframes/references/transcript-guide.md
Zakaria a46764fb1b
Some checks failed
ci / Validate workspace (push) Has been cancelled
landing-page-ci / Validate landing page (push) Has been cancelled
landing-page-deploy / Deploy landing page (push) Has been cancelled
github-metrics / Generate repository metrics SVG (push) Has been cancelled
first-commit
2026-05-04 14:58:14 -04:00

152 lines
7.8 KiB
Markdown
Raw Permalink Blame History

# Transcript Guide
## How Transcripts Are Generated
`hyperframes transcribe` handles both transcription and format conversion:
```bash
# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
npx hyperframes transcribe audio.mp3
# Use a larger model for better accuracy
npx hyperframes transcribe audio.mp3 --model medium.en
# Filter to English only (skips non-English speech)
npx hyperframes transcribe audio.mp3 --language en
# Import an existing transcript from another tool
npx hyperframes transcribe captions.srt
npx hyperframes transcribe captions.vtt
npx hyperframes transcribe openai-response.json
```
## Supported Input Formats
The CLI auto-detects and normalizes these formats:
| Format | Extension | Source | Word-level? |
| --------------------- | --------- | --------------------------------------------------------------------------- | ----------------- |
| whisper.cpp JSON | `.json` | `hyperframes init --video`, `hyperframes transcribe` | Yes |
| OpenAI Whisper API | `.json` | `openai.audio.transcriptions.create({ timestamp_granularities: ["word"] })` | Yes |
| SRT subtitles | `.srt` | Video editors, subtitle tools, YouTube | No (phrase-level) |
| VTT subtitles | `.vtt` | Web players, YouTube, transcription services | No (phrase-level) |
| Normalized word array | `.json` | Pre-processed by any tool | Yes |
**Word-level timestamps produce better captions.** SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.
## Whisper Model Guide
The default model (`small.en`) balances accuracy and speed. For better results, use a larger model:
| Model | Size | Speed | Accuracy | When to use |
| ---------- | ------ | -------- | --------- | ------------------------------------- |
| `tiny` | 75 MB | Fastest | Low | Quick previews, testing pipeline |
| `base` | 142 MB | Fast | Fair | Short clips, clear audio |
| `small` | 466 MB | Moderate | Good | **Default** — good for most content |
| `medium` | 1.5 GB | Slow | Very good | Important content, noisy audio, music |
| `large-v3` | 3.1 GB | Slowest | Best | Production quality |
**Only add `.en` suffix when the user explicitly says the audio is English.** `.en` models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.
**Critical: `.en` models translate non-English audio into English** — they don't transcribe it. If the audio might not be English, always use a model without the `.en` suffix and pass `--language` to specify the source language. If you're unsure of the language, use `small` (not `small.en`) without `--language` — whisper will auto-detect.
```bash
# Spanish audio
npx hyperframes transcribe audio.mp3 --model small --language es
# Unknown language — let whisper auto-detect
npx hyperframes transcribe audio.mp3 --model small
```
**Music and vocals over instrumentation**: `small.en` will misidentify lyrics — use `medium.en` as the minimum, or import lyrics manually. Even `medium.en` struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with `hyperframes transcribe lyrics.srt` will always beat automated transcription.
## Transcript Quality Check (Mandatory)
After every transcription, **read the transcript and check for quality issues before proceeding.** Bad transcripts produce nonsensical captions. Never skip this step.
### What to look for
| Signal | Example | Cause |
| ---------------------------- | -------------------------------------- | ---------------------------------------------------------------------------- |
| Music note tokens (`♪`, `<60>`) | `{ "text": "♪" }` or `{ "text": "<22>" }` | Whisper detected music, not speech |
| Garbled / nonsense words | "Do a chin", "Get so gay", "huh" | Model misheard lyrics or background noise |
| Long gaps with no words | 20+ seconds of only `♪` tokens | Instrumental section — expected, but high ratio means speech is being missed |
| Repeated filler | Many "huh", "uh", "oh" entries | Model is hallucinating on music |
| Very short word spans | Words with `end - start < 0.05` | Unreliable timestamp alignment |
### Automatic retry rules
**If more than 20% of entries are `♪`/`<60>` tokens, or the transcript contains obvious nonsense words, the transcription failed.** Do not proceed with the bad transcript. Instead:
1. **Retry with `medium.en`** if the original used `small.en` or smaller:
```bash
npx hyperframes transcribe audio.mp3 --model medium.en
```
2. **If `medium.en` also fails** (still >20% music tokens or garbled), tell the user the audio is too noisy for local transcription and suggest:
- Providing lyrics manually as an SRT/VTT file
- Using an external API (OpenAI or Groq Whisper — see below)
3. **Always clean the transcript** before building captions — filter out `♪`/`<60>` tokens and entries where `text` is a single non-word character. Only real words should reach the caption composition.
### Cleaning a transcript
After transcription (even with a good model), strip non-word entries:
```js
var raw = JSON.parse(transcriptJson);
var words = raw.filter(function (w) {
if (!w.text || w.text.trim().length === 0) return false;
if (/^[♪<>\u266a\u266b\u266c\u266d\u266e\u266f]+$/.test(w.text)) return false;
if (/^(huh|uh|um|ah|oh)$/i.test(w.text) && w.end - w.start < 0.1) return false;
return true;
});
```
### When to use which model (decision tree)
1. **Is this speech over silence/light background?**`small.en` is fine
2. **Is this speech over music, or music with vocals?** → Start with `medium.en`
3. **Is this a produced music track (vocals + full instrumentation)?** → Start with `medium.en`, expect to need manual lyrics or an external API
4. **Is this multilingual?** → Use `medium` or `large-v3` (no `.en` suffix)
## Using External Transcription APIs
For the best accuracy, use an external API and import the result:
**OpenAI Whisper API** (recommended for quality):
```bash
# Generate with word timestamps, then import
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@audio.mp3 -F model=whisper-1 \
-F response_format=verbose_json \
-F "timestamp_granularities[]=word" \
-o transcript-openai.json
npx hyperframes transcribe transcript-openai.json
```
**Groq Whisper API** (fast, free tier available):
```bash
curl https://api.groq.com/openai/v1/audio/transcriptions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-F file=@audio.mp3 -F model=whisper-large-v3 \
-F response_format=verbose_json \
-F "timestamp_granularities[]=word" \
-o transcript-groq.json
npx hyperframes transcribe transcript-groq.json
```
## If No Transcript Exists
1. Check the project root for `transcript.json`, `.srt`, or `.vtt` files
2. If none found, run transcription — pick the starting model based on the content type:
- Speech/voiceover → `small.en`
- Music with vocals → `medium.en`
```bash
npx hyperframes transcribe <audio-or-video-file> --model medium.en
```
3. **Read the transcript and run the quality check** (see above). If it fails, retry with a larger model or suggest manual lyrics.