Zakaria/open-design

Fork 0

Zakaria a46764fb1b

ci / Validate workspace (push) Has been cancelled

Details

landing-page-ci / Validate landing page (push) Has been cancelled

Details

landing-page-deploy / Deploy landing page (push) Has been cancelled

Details

github-metrics / Generate repository metrics SVG (push) Has been cancelled

Details

first-commit

2026-05-04 14:58:14 -04:00

7.8 KiB

Raw Permalink Blame History

Transcript Guide

How Transcripts Are Generated

hyperframes transcribe handles both transcription and format conversion:

# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
npx hyperframes transcribe audio.mp3

# Use a larger model for better accuracy
npx hyperframes transcribe audio.mp3 --model medium.en

# Filter to English only (skips non-English speech)
npx hyperframes transcribe audio.mp3 --language en

# Import an existing transcript from another tool
npx hyperframes transcribe captions.srt
npx hyperframes transcribe captions.vtt
npx hyperframes transcribe openai-response.json

Supported Input Formats

The CLI auto-detects and normalizes these formats:

Format	Extension	Source	Word-level?
whisper.cpp JSON	`.json`	`hyperframes init --video`, `hyperframes transcribe`	Yes
OpenAI Whisper API	`.json`	`openai.audio.transcriptions.create({ timestamp_granularities: ["word"] })`	Yes
SRT subtitles	`.srt`	Video editors, subtitle tools, YouTube	No (phrase-level)
VTT subtitles	`.vtt`	Web players, YouTube, transcription services	No (phrase-level)
Normalized word array	`.json`	Pre-processed by any tool	Yes

Word-level timestamps produce better captions. SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.

Whisper Model Guide

The default model (small.en) balances accuracy and speed. For better results, use a larger model:

Model	Size	Speed	Accuracy	When to use
`tiny`	75 MB	Fastest	Low	Quick previews, testing pipeline
`base`	142 MB	Fast	Fair	Short clips, clear audio
`small`	466 MB	Moderate	Good	Default — good for most content
`medium`	1.5 GB	Slow	Very good	Important content, noisy audio, music
`large-v3`	3.1 GB	Slowest	Best	Production quality

Only add .en suffix when the user explicitly says the audio is English. .en models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.

Critical: .en models translate non-English audio into English — they don't transcribe it. If the audio might not be English, always use a model without the .en suffix and pass --language to specify the source language. If you're unsure of the language, use small (not small.en) without --language — whisper will auto-detect.

# Spanish audio
npx hyperframes transcribe audio.mp3 --model small --language es

# Unknown language — let whisper auto-detect
npx hyperframes transcribe audio.mp3 --model small

Music and vocals over instrumentation: small.en will misidentify lyrics — use medium.en as the minimum, or import lyrics manually. Even medium.en struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with hyperframes transcribe lyrics.srt will always beat automated transcription.

Transcript Quality Check (Mandatory)

After every transcription, read the transcript and check for quality issues before proceeding. Bad transcripts produce nonsensical captions. Never skip this step.

What to look for

Signal	Example	Cause
Music note tokens (`♪`, `<EFBFBD>`)	`{ "text": "♪" }` or `{ "text": "<22>" }`	Whisper detected music, not speech
Garbled / nonsense words	"Do a chin", "Get so gay", "huh"	Model misheard lyrics or background noise
Long gaps with no words	20+ seconds of only `♪` tokens	Instrumental section — expected, but high ratio means speech is being missed
Repeated filler	Many "huh", "uh", "oh" entries	Model is hallucinating on music
Very short word spans	Words with `end - start < 0.05`	Unreliable timestamp alignment

Automatic retry rules

If more than 20% of entries are ♪/<EFBFBD> tokens, or the transcript contains obvious nonsense words, the transcription failed. Do not proceed with the bad transcript. Instead:

Retry with medium.en if the original used small.en or smaller:
```
npx hyperframes transcribe audio.mp3 --model medium.en
```
If medium.en also fails (still >20% music tokens or garbled), tell the user the audio is too noisy for local transcription and suggest:
- Providing lyrics manually as an SRT/VTT file
- Using an external API (OpenAI or Groq Whisper — see below)
Always clean the transcript before building captions — filter out ♪/<EFBFBD> tokens and entries where text is a single non-word character. Only real words should reach the caption composition.

Cleaning a transcript

After transcription (even with a good model), strip non-word entries:

var raw = JSON.parse(transcriptJson);
var words = raw.filter(function (w) {
  if (!w.text || w.text.trim().length === 0) return false;
  if (/^[♪<>\u266a\u266b\u266c\u266d\u266e\u266f]+$/.test(w.text)) return false;
  if (/^(huh|uh|um|ah|oh)$/i.test(w.text) && w.end - w.start < 0.1) return false;
  return true;
});

When to use which model (decision tree)

Is this speech over silence/light background? → small.en is fine
Is this speech over music, or music with vocals? → Start with medium.en
Is this a produced music track (vocals + full instrumentation)? → Start with medium.en, expect to need manual lyrics or an external API
Is this multilingual? → Use medium or large-v3 (no .en suffix)

Using External Transcription APIs

For the best accuracy, use an external API and import the result:

OpenAI Whisper API (recommended for quality):

# Generate with word timestamps, then import
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@audio.mp3 -F model=whisper-1 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word" \
  -o transcript-openai.json

npx hyperframes transcribe transcript-openai.json

Groq Whisper API (fast, free tier available):

curl https://api.groq.com/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -F file=@audio.mp3 -F model=whisper-large-v3 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word" \
  -o transcript-groq.json

npx hyperframes transcribe transcript-groq.json

If No Transcript Exists

Check the project root for transcript.json, .srt, or .vtt files
If none found, run transcription — pick the starting model based on the content type:
- Speech/voiceover → small.en
- Music with vocals → medium.en
```
npx hyperframes transcribe <audio-or-video-file> --model medium.en
```
Read the transcript and run the quality check (see above). If it fails, retry with a larger model or suggest manual lyrics.

7.8 KiB Raw Permalink Blame History Unescape Escape

Transcript Guide

How Transcripts Are Generated

Supported Input Formats

Whisper Model Guide

Transcript Quality Check (Mandatory)

What to look for

Automatic retry rules

Cleaning a transcript

When to use which model (decision tree)

Using External Transcription APIs

If No Transcript Exists

7.8 KiB

Raw Permalink Blame History