open-design/skills/hyperframes/references/transcript-guide.md

# Transcript Guide

## How Transcripts Are Generated

`hyperframes transcribe` handles both transcription and format conversion:

```bash
# Transcribe audio/video (uses whisper.cpp locally, no API key needed)
npx hyperframes transcribe audio.mp3

# Use a larger model for better accuracy
npx hyperframes transcribe audio.mp3 --model medium.en

# Filter to English only (skips non-English speech)
npx hyperframes transcribe audio.mp3 --language en

# Import an existing transcript from another tool
npx hyperframes transcribe captions.srt
npx hyperframes transcribe captions.vtt
npx hyperframes transcribe openai-response.json
```

## Supported Input Formats

The CLI auto-detects and normalizes these formats:

| Format                | Extension | Source                                                                      | Word-level?       |
| --------------------- | --------- | --------------------------------------------------------------------------- | ----------------- |
| whisper.cpp JSON      | `.json`   | `hyperframes init --video`, `hyperframes transcribe`                        | Yes               |
| OpenAI Whisper API    | `.json`   | `openai.audio.transcriptions.create({ timestamp_granularities: ["word"] })` | Yes               |
| SRT subtitles         | `.srt`    | Video editors, subtitle tools, YouTube                                      | No (phrase-level) |
| VTT subtitles         | `.vtt`    | Web players, YouTube, transcription services                                | No (phrase-level) |
| Normalized word array | `.json`   | Pre-processed by any tool                                                   | Yes               |

**Word-level timestamps produce better captions.** SRT/VTT give phrase-level timing, which works but can't do per-word animation effects.

## Whisper Model Guide

The default model (`small.en`) balances accuracy and speed. For better results, use a larger model:

| Model      | Size   | Speed    | Accuracy  | When to use                           |
| ---------- | ------ | -------- | --------- | ------------------------------------- |
| `tiny`     | 75 MB  | Fastest  | Low       | Quick previews, testing pipeline      |
| `base`     | 142 MB | Fast     | Fair      | Short clips, clear audio              |
| `small`    | 466 MB | Moderate | Good      | **Default** — good for most content   |
| `medium`   | 1.5 GB | Slow     | Very good | Important content, noisy audio, music |
| `large-v3` | 3.1 GB | Slowest  | Best      | Production quality                    |

**Only add `.en` suffix when the user explicitly says the audio is English.** `.en` models are slightly more accurate for English but will TRANSLATE non-English audio instead of transcribing it.

**Critical: `.en` models translate non-English audio into English** — they don't transcribe it. If the audio might not be English, always use a model without the `.en` suffix and pass `--language` to specify the source language. If you're unsure of the language, use `small` (not `small.en`) without `--language` — whisper will auto-detect.

```bash
# Spanish audio
npx hyperframes transcribe audio.mp3 --model small --language es

# Unknown language — let whisper auto-detect
npx hyperframes transcribe audio.mp3 --model small
```

**Music and vocals over instrumentation**: `small.en` will misidentify lyrics — use `medium.en` as the minimum, or import lyrics manually. Even `medium.en` struggles with heavily produced tracks; for music videos, providing known lyrics as an SRT/VTT and importing with `hyperframes transcribe lyrics.srt` will always beat automated transcription.

## Transcript Quality Check (Mandatory)

After every transcription, **read the transcript and check for quality issues before proceeding.** Bad transcripts produce nonsensical captions. Never skip this step.

### What to look for

| Signal                       | Example                                | Cause                                                                        |
| ---------------------------- | -------------------------------------- | ---------------------------------------------------------------------------- |
| Music note tokens (`♪`, `<60>`) | `{ "text": "♪" }` or `{ "text": "<22>" }` | Whisper detected music, not speech                                           |
| Garbled / nonsense words     | "Do a chin", "Get so gay", "huh"       | Model misheard lyrics or background noise                                    |
| Long gaps with no words      | 20+ seconds of only `♪` tokens         | Instrumental section — expected, but high ratio means speech is being missed |
| Repeated filler              | Many "huh", "uh", "oh" entries         | Model is hallucinating on music                                              |
| Very short word spans        | Words with `end - start < 0.05`        | Unreliable timestamp alignment                                               |

### Automatic retry rules

**If more than 20% of entries are `♪`/`<60>` tokens, or the transcript contains obvious nonsense words, the transcription failed.** Do not proceed with the bad transcript. Instead:

1. **Retry with `medium.en`** if the original used `small.en` or smaller:
   ```bash
   npx hyperframes transcribe audio.mp3 --model medium.en
   ```
2. **If `medium.en` also fails** (still >20% music tokens or garbled), tell the user the audio is too noisy for local transcription and suggest:
   - Providing lyrics manually as an SRT/VTT file
   - Using an external API (OpenAI or Groq Whisper — see below)
3. **Always clean the transcript** before building captions — filter out `♪`/`<60>` tokens and entries where `text` is a single non-word character. Only real words should reach the caption composition.

### Cleaning a transcript

After transcription (even with a good model), strip non-word entries:

```js
var raw = JSON.parse(transcriptJson);
var words = raw.filter(function (w) {
  if (!w.text || w.text.trim().length === 0) return false;
  if (/^[♪<>\u266a\u266b\u266c\u266d\u266e\u266f]+$/.test(w.text)) return false;
  if (/^(huh|uh|um|ah|oh)$/i.test(w.text) && w.end - w.start < 0.1) return false;
  return true;
});
```

### When to use which model (decision tree)

1. **Is this speech over silence/light background?** → `small.en` is fine
2. **Is this speech over music, or music with vocals?** → Start with `medium.en`
3. **Is this a produced music track (vocals + full instrumentation)?** → Start with `medium.en`, expect to need manual lyrics or an external API
4. **Is this multilingual?** → Use `medium` or `large-v3` (no `.en` suffix)

## Using External Transcription APIs

For the best accuracy, use an external API and import the result:

**OpenAI Whisper API** (recommended for quality):

```bash
# Generate with word timestamps, then import
curl https://api.openai.com/v1/audio/transcriptions \
  -H "Authorization: Bearer $OPENAI_API_KEY" \
  -F file=@audio.mp3 -F model=whisper-1 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word" \
  -o transcript-openai.json

npx hyperframes transcribe transcript-openai.json
```

**Groq Whisper API** (fast, free tier available):

```bash
curl https://api.groq.com/openai/v1/audio/transcriptions \
  -H "Authorization: Bearer $GROQ_API_KEY" \
  -F file=@audio.mp3 -F model=whisper-large-v3 \
  -F response_format=verbose_json \
  -F "timestamp_granularities[]=word" \
  -o transcript-groq.json

npx hyperframes transcribe transcript-groq.json
```

## If No Transcript Exists

1. Check the project root for `transcript.json`, `.srt`, or `.vtt` files
2. If none found, run transcription — pick the starting model based on the content type:
   - Speech/voiceover → `small.en`
   - Music with vocals → `medium.en`
   ```bash
   npx hyperframes transcribe <audio-or-video-file> --model medium.en
   ```
3. **Read the transcript and run the quality check** (see above). If it fails, retry with a larger model or suggest manual lyrics.