7.8 KiB
Transcript Guide
How Transcripts Are Generated
hyperframes transcribe handles both transcription and format conversion:
# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
npx hyperframes transcribe audio.mp3
# Use a larger model for better accuracy
npx hyperframes transcribe audio.mp3 --model medium.en
# Filter to English only (skips non-English speech)
npx hyperframes transcribe audio.mp3 --language en
# Import an existing transcript from another tool
npx hyperframes transcribe captions.srt
npx hyperframes transcribe captions.vtt
npx hyperframes transcribe openai-response.json
Supported Input Formats
The CLI auto-detects and normalizes these formats:
| Format | Extension | Source | Word-level? |
|---|---|---|---|
| whisper.cpp JSON | .json |
hyperframes init --video, hyperframes transcribe |
Yes |
| OpenAI Whisper API | .json |
openai.audio.transcriptions.create({ timestamp_granularities: ["word"] }) |
Yes |
| SRT subtitles | .srt |
Video editors, subtitle tools, YouTube | No (phrase-level) |
| VTT subtitles | .vtt |
Web players, YouTube, transcription services | No (phrase-level) |
| Normalized word array | .json |
Pre-processed by any tool | Yes |
Word-level timestamps produce better captions. SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.
Whisper Model Guide
The default model (small.en) balances accuracy and speed. For better results, use a larger model:
| Model | Size | Speed | Accuracy | When to use |
|---|---|---|---|---|
tiny |
75 MB | Fastest | Low | Quick previews, testing pipeline |
base |
142 MB | Fast | Fair | Short clips, clear audio |
small |
466 MB | Moderate | Good | Default — good for most content |
medium |
1.5 GB | Slow | Very good | Important content, noisy audio, music |
large-v3 |
3.1 GB | Slowest | Best | Production quality |
Only add .en suffix when the user explicitly says the audio is English. .en models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.
Critical: .en models translate non-English audio into English — they don't transcribe it. If the audio might not be English, always use a model without the .en suffix and pass --language to specify the source language. If you're unsure of the language, use small (not small.en) without --language — whisper will auto-detect.
# Spanish audio
npx hyperframes transcribe audio.mp3 --model small --language es
# Unknown language — let whisper auto-detect
npx hyperframes transcribe audio.mp3 --model small
Music and vocals over instrumentation: small.en will misidentify lyrics — use medium.en as the minimum, or import lyrics manually. Even medium.en struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with hyperframes transcribe lyrics.srt will always beat automated transcription.
Transcript Quality Check (Mandatory)
After every transcription, read the transcript and check for quality issues before proceeding. Bad transcripts produce nonsensical captions. Never skip this step.
What to look for
| Signal | Example | Cause |
|---|---|---|
Music note tokens (♪, <EFBFBD>) |
{ "text": "♪" } or { "text": "<22>" } |
Whisper detected music, not speech |
| Garbled / nonsense words | "Do a chin", "Get so gay", "huh" | Model misheard lyrics or background noise |
| Long gaps with no words | 20+ seconds of only ♪ tokens |
Instrumental section — expected, but high ratio means speech is being missed |
| Repeated filler | Many "huh", "uh", "oh" entries | Model is hallucinating on music |
| Very short word spans | Words with end - start < 0.05 |
Unreliable timestamp alignment |
Automatic retry rules
If more than 20% of entries are ♪/<EFBFBD> tokens, or the transcript contains obvious nonsense words, the transcription failed. Do not proceed with the bad transcript. Instead:
- Retry with
medium.enif the original usedsmall.enor smaller:npx hyperframes transcribe audio.mp3 --model medium.en - If
medium.enalso fails (still >20% music tokens or garbled), tell the user the audio is too noisy for local transcription and suggest:- Providing lyrics manually as an SRT/VTT file
- Using an external API (OpenAI or Groq Whisper — see below)
- Always clean the transcript before building captions — filter out
♪/<EFBFBD>tokens and entries wheretextis a single non-word character. Only real words should reach the caption composition.
Cleaning a transcript
After transcription (even with a good model), strip non-word entries:
var raw = JSON.parse(transcriptJson);
var words = raw.filter(function (w) {
if (!w.text || w.text.trim().length === 0) return false;
if (/^[♪<>\u266a\u266b\u266c\u266d\u266e\u266f]+$/.test(w.text)) return false;
if (/^(huh|uh|um|ah|oh)$/i.test(w.text) && w.end - w.start < 0.1) return false;
return true;
});
When to use which model (decision tree)
- Is this speech over silence/light background? →
small.enis fine - Is this speech over music, or music with vocals? → Start with
medium.en - Is this a produced music track (vocals + full instrumentation)? → Start with
medium.en, expect to need manual lyrics or an external API - Is this multilingual? → Use
mediumorlarge-v3(no.ensuffix)
Using External Transcription APIs
For the best accuracy, use an external API and import the result:
OpenAI Whisper API (recommended for quality):
# Generate with word timestamps, then import
curl https://api.openai.com/v1/audio/transcriptions \
-H "Authorization: Bearer $OPENAI_API_KEY" \
-F file=@audio.mp3 -F model=whisper-1 \
-F response_format=verbose_json \
-F "timestamp_granularities[]=word" \
-o transcript-openai.json
npx hyperframes transcribe transcript-openai.json
Groq Whisper API (fast, free tier available):
curl https://api.groq.com/openai/v1/audio/transcriptions \
-H "Authorization: Bearer $GROQ_API_KEY" \
-F file=@audio.mp3 -F model=whisper-large-v3 \
-F response_format=verbose_json \
-F "timestamp_granularities[]=word" \
-o transcript-groq.json
npx hyperframes transcribe transcript-groq.json
If No Transcript Exists
- Check the project root for
transcript.json,.srt, or.vttfiles - If none found, run transcription — pick the starting model based on the content type:
- Speech/voiceover →
small.en - Music with vocals →
medium.en
npx hyperframes transcribe <audio-or-video-file> --model medium.en - Speech/voiceover →
- Read the transcript and run the quality check (see above). If it fails, retry with a larger model or suggest manual lyrics.