# Font Discipline for PPTX Exports

Companion to `layout-discipline.md`. The rail / cursor primitives in that
file catch geometric drift; this file catches the typography drift that
geometry can't see — variable-font traps, missing CJK slots, fake italic
on Han characters. These are the bugs that pass `verify_layout.py` and
still look wrong.

Read this when:

- The audit table has 🟡 entries about italic / em / font fallback.
- PowerPoint silently swaps to Calibri / Arial / Microsoft JhengHei /
  Georgia after you specified a different family.
- `unzip pptx | grep typeface` shows a face that isn't in your design system.

## Layer 1 — Font mapping in the export script

Walk each CSS class used by the source HTML and confirm the export
script maps it to the **same** font family.

⚠️ **Trap:** the visual category your eye reads is not always the
class's semantic category. Editorial decks routinely bind `.lead`,
`.callout`, or `.q-big` to a serif face, not the sans-serif you'd guess
from "lead". Open the HTML's CSS, read the `font-family` declaration
for each class, and copy the literal family name into the export's
font table.

Don't rely on visual intuition; rely on grep.

> **Coverage gap for Latin-slot scripts (Cyrillic / Greek / Vietnamese).**
> Russian / Ukrainian / Greek runs go through `<a:latin>`, not `<a:ea>` —
> they use the Latin slot. Many display fonts (Playfair Display, Source
> Serif 4) ship with weak or missing Cyrillic / Greek glyphs, and most
> drop Vietnamese Extended diacritics (ếẫỡỗ). PowerPoint silently falls
> back to Calibri / Times New Roman per missing glyph, producing
> mid-paragraph face shifts that look like a styling bug.
>
> When mapping a CSS class to a Latin font, check the font actually
> covers your scripts:
>
> ```bash
> # macOS / Linux: list the unicode blocks a font supports
> fc-query -f '%{charset}\n' "$(fc-match -f '%{file}\n' 'Playfair Display')" | head
> ```
>
> ```powershell
> # Windows: PowerShell + System.Drawing reads the registered family list
> [System.Reflection.Assembly]::LoadWithPartialName("System.Drawing") | Out-Null
> $f = New-Object System.Drawing.Text.PrivateFontCollection
> # Coverage detail (Unicode ranges) is best read in fontforge:
> # File → Open → pick the .ttf / .otf → Element → Font Info → OS/2 → Unicode Ranges.
> ```
>
> Cross-platform fallback: open the font in fontforge → Element → Font Info → OS/2 → Unicode Ranges.
>
> If coverage is missing, either swap to a face that has it (e.g.
> Inter / IBM Plex Sans for Cyrillic; Be Vietnam Pro for Vietnamese) or
> set a different `<a:latin>` per language run.

## Layer 2 — Font presence on the rendering machine

PowerPoint uses the OS font cache. If the family name in your XML isn't
installed, PowerPoint silently falls back. Check:

```bash
fc-list | grep -i "noto serif"            # Linux / WSL
mdfind "kMDItemFSName == '*NotoSerif*'"   # macOS
```

```powershell
# Windows (PowerShell)
Get-ChildItem -Path "$env:WINDIR\Fonts","$env:LOCALAPPDATA\Microsoft\Windows\Fonts" `
  -Filter "*NotoSerif*" -ErrorAction SilentlyContinue
```

Install missing families:

```bash
brew install --cask \
  font-noto-serif-tc \
  font-playfair-display \
  font-source-serif-4 \
  font-ibm-plex-mono
```

The `verify_layout.py` script can't see this — it only checks
geometry. A standalone font audit step is required.

## Layer 3 — Variable fonts vs. static families ← most common trap

Modern fonts often ship as a **single variable file** containing all
weights (`NotoSerifTC[wght].ttf`). Looks elegant, but PowerPoint Mac /
Windows have spotty support:

- macOS reports the variable font's family name as its **default static
  instance** — usually ExtraLight or Regular.
- PowerPoint asks the OS for "Noto Serif TC, weight 700"; the OS
  reports the family as `Noto Serif TC ExtraLight`; PowerPoint can't
  match → falls back to a system serif.

Diagnose:

```bash
ls -la ~/Library/Fonts/ | grep -i NotoSerif
```

| What you see                           | Verdict                                 |
| -------------------------------------- | --------------------------------------- |
| One `*[wght].ttf` file                 | Variable. PowerPoint may not match.     |
| Multiple `*-Regular.otf`, `*-Bold.otf` | Static family. Safe.                    |

Fix by using the static family equivalent:

| Don't use (variable)        | Use instead (static)              |
| --------------------------- | --------------------------------- |
| `Noto Serif TC` (variable)  | `Noto Serif CJK TC`               |
| `Source Serif 4` (variable) | `Source Serif Pro` / `Source Serif 4` static instances |
| `Inter` (variable)          | Per-weight `Inter Regular` / `Inter Bold` |

After fixing the export, re-run `extract_pptx.py` and confirm the
`font` field matches the static name.

## Layer 4 — PPTX XML's three-language slots

PowerPoint chooses a typeface per run by language script. Each run can
declare three:

| Attribute               | Used for                         |
| ----------------------- | -------------------------------- |
| `<a:latin typeface=…>`  | Latin script (a-z, A-Z, digits)  |
| `<a:ea typeface=…>`     | East Asian (CJK) — **Chinese / Japanese / Korean go here** |
| `<a:cs typeface=…>`     | Complex script (Arabic, Hebrew, Thai) |

Audit a file:

```bash
unzip -o /path/to/deck.pptx -d /tmp/audit
grep -h -oE 'typeface="[^"]+"' /tmp/audit/ppt/slides/slide*.xml | sort -u
```

Expected output: only the design-system fonts. If you see
`Microsoft JhengHei`, `Calibri`, `Arial`, `Georgia`, `Consolas`,
something has fallen back.

**Common defect:** export script writes `<a:latin>` only. Chinese runs
have no `<a:ea>` directive → PowerPoint picks the OS default
(Microsoft JhengHei on Windows, Hiragino Sans on Mac). Result: Chinese
characters in the wrong serif/sans family.

Fix: when adding a run with mixed-language content, set all three
attributes that apply.

```python
from pptx.oxml.ns import qn

def set_run_fonts(run, latin: str | None = None, ea: str | None = None, cs: str | None = None):
    rPr = run._r.get_or_add_rPr()
    if latin:
        el = rPr.find(qn('a:latin'))
        if el is None:
            el = rPr.makeelement(qn('a:latin'), {})
            rPr.append(el)
        el.set('typeface', latin)
    if ea:
        el = rPr.find(qn('a:ea'))
        if el is None:
            el = rPr.makeelement(qn('a:ea'), {})
            rPr.append(el)
        el.set('typeface', ea)
    if cs:
        el = rPr.find(qn('a:cs'))
        if el is None:
            el = rPr.makeelement(qn('a:cs'), {})
            rPr.append(el)
        el.set('typeface', cs)
```

PptxGenJS sets all three by default; raw XML injection or python-pptx
without explicit `ea` slot does not.

## Layer 5 — Italic + script interaction

🚨 **`italic=True` is a Latin-script feature.** Apply it only to runs
whose characters belong to scripts where italic is part of the writing
tradition (Latin, Cyrillic, Greek). For everything else — CJK, Arabic,
Hebrew, Devanagari, Thai, Khmer — PowerPoint synthesizes a slanted
bitmap that looks mechanically deformed. The chain of failures, using
CJK as the canonical example:

1. `<a:latin>` slot has Playfair Display Italic (a Latin-only font).
2. The CJK characters in the run have no glyph in Playfair → PowerPoint
   substitutes a system CJK font.
3. The substituted CJK font is forced into `italic=True` → since no
   real CJK italic exists, PowerPoint synthesizes a slanted bitmap →
   characters look mechanically deformed.

The same pattern triggers for Arabic, Hebrew, Devanagari, and Thai —
none of these scripts has an italic tradition, and faking it produces
a slant that's visually broken.

**Rule:** italic only applies to runs whose primary script supports it
(Latin / Cyrillic / Greek). Indicate emphasis on other scripts via:

- color tone (`COLOR_INK_60` for muted, full ink for emphasis)
- weight contrast (Regular 400 vs. Bold 700)
- a script-native italic variant **only if one actually ships** — most
  don't

Practical implementation:

```python
# Unicode ranges where italic should be suppressed.
# Principle: include scripts whose writing tradition has no italic style.
# Synthesized italic on these scripts produces a slanted bitmap that looks
# mechanically deformed.
NO_ITALIC_RANGES = (
    (0x3400, 0x9FFF),    # CJK Unified Ideographs
    (0xF900, 0xFAFF),    # CJK Compatibility Ideographs
    (0x3040, 0x30FF),    # Hiragana + Katakana
    (0xAC00, 0xD7AF),    # Hangul Syllables
    (0x0590, 0x05FF),    # Hebrew
    (0x0600, 0x06FF),    # Arabic
    (0x0750, 0x077F),    # Arabic Supplement
    # Indic scripts — none have an italic tradition; PowerPoint synthesizes
    # a fake slant on all of them. Add new ranges here when the deck mixes
    # in additional scripts (e.g. Sinhala U+0D80–U+0DFF).
    (0x0900, 0x097F),    # Devanagari (Hindi, Marathi, Sanskrit)
    (0x0980, 0x09FF),    # Bengali
    (0x0A00, 0x0A7F),    # Gurmukhi (Punjabi)
    (0x0A80, 0x0AFF),    # Gujarati
    (0x0B00, 0x0B7F),    # Oriya
    (0x0B80, 0x0BFF),    # Tamil
    (0x0C00, 0x0C7F),    # Telugu
    (0x0C80, 0x0CFF),    # Kannada
    (0x0D00, 0x0D7F),    # Malayalam
    # Southeast Asian
    (0x0E00, 0x0E7F),    # Thai
    (0x0E80, 0x0EFF),    # Lao
    (0x1780, 0x17FF),    # Khmer
)


def has_no_italic_script(text: str) -> bool:
    return any(
        any(lo <= ord(c) <= hi for lo, hi in NO_ITALIC_RANGES)
        for c in text
    )


def add_run_with_italic_safety(p, text, *, latin_face: str, ea_face: str,
                               cs_face: str | None, size_pt: int,
                               italic: bool, **kwargs):
    """Drop italic if the run contains characters from scripts without italic tradition.

    Args:
        latin_face: Font for Latin / Cyrillic / Greek runs (a:latin slot).
        ea_face: Font for CJK runs (a:ea slot).
        cs_face: Font for complex scripts — Arabic, Hebrew, Devanagari,
            Thai, etc. (a:cs slot). Pass None when the run contains no
            complex-script characters; set_run_fonts skips the slot.
    """
    r = p.add_run()
    r.text = text
    r.font.size = Pt(size_pt)
    r.font.italic = italic and not has_no_italic_script(text)
    set_run_fonts(r, latin=latin_face, ea=ea_face, cs=cs_face)
    return r
```

For mixed-script runs (e.g. `"In <em>2026</em> 開始"`), split into
multiple runs at language boundaries so the italic attribute can apply
to the Latin run only.

## Beyond CJK — other scripts

The five layers above are written in CJK examples because that's the
most common pairing in Open Design today, but the same machinery
applies to other scripts. Quick reference:

| Script family            | XML slot   | Italic OK? | Most common defect                                                                  | Recommended faces                                |
| ------------------------ | ---------- | ---------- | ----------------------------------------------------------------------------------- | ------------------------------------------------ |
| Latin (en, de, es, vi…)  | `a:latin`  | ✅          | Vietnamese Extended diacritics dropped → fallback Calibri mid-paragraph             | Be Vietnam Pro, IBM Plex Sans, Source Sans 3     |
| Cyrillic (ru, uk, bg)    | `a:latin`  | ✅          | Display fonts (Playfair, Source Serif) lack Cyrillic → fallback Calibri             | Inter, IBM Plex Sans, Roboto                     |
| Greek (el)               | `a:latin`  | ✅          | Same as Cyrillic — display faces missing Greek → fallback                           | Inter, IBM Plex Sans                             |
| CJK (zh, ja, ko)         | `a:ea`     | ❌          | Variable-font trap (Layer 3); missing `a:ea` slot → fallback Microsoft JhengHei     | Noto Sans CJK *, Source Han Sans, IBM Plex Sans JP |
| Arabic / Hebrew / Persian | `a:cs`    | ❌          | `<a:rtl val="1"/>` not set → text direction breaks; kashida changes width           | Noto Naskh Arabic, IBM Plex Sans Arabic, Amiri   |
| Devanagari / Bengali     | `a:cs`     | ❌          | PowerPoint defaults to Mangal/Vrinda (low fidelity); cluster shaping bumps line height | Noto Sans Devanagari, Mukta, Hind             |
| Thai / Lao / Khmer       | `a:cs`     | ❌          | No inter-word spaces → PowerPoint's break engine produces poor wraps; tone marks bump line height | Noto Sans Thai, Sarabun, Noto Sans Khmer  |

For RTL scripts (Arabic / Hebrew / Persian), set both `<a:cs typeface=…>`
and `<a:rtl val="1"/>` on the run's `rPr`. Right-alignment, bidi text
flow, and chrome / footer mirroring are out of scope for `verify_layout.py`
today and need manual review — see the Tier 2 follow-up note in the
audit checklist.

> **RTL discipline scope.** Full RTL support is roughly 15–20% of the
> font + layout discipline surface area: Unicode TR9 bidi resolution,
> chrome / footer / page-number mirroring, kashida (Arabic
> elongation) interaction with line-fill, and right-anchored
> alignment. This skill covers the typeface + slot mechanics only;
> bidi and mirroring are flagged for a Tier 2 `rtl-discipline.md`
> follow-up when fa / ar / he usage volume justifies the investment.

## Line height per script

The `Cursor.take(gap=Inches(0.12))` default suits 14pt Latin body copy.
Other scripts need more vertical headroom because of stacked diacritics,
matras, or tone marks:

| Script                                   | Recommended `gap` at 14pt body |
| ---------------------------------------- | ------------------------------ |
| Latin (no Vietnamese Extended)           | `Inches(0.12)` (default)       |
| Latin (with Vietnamese Extended ếẫỗ)     | `Inches(0.14)`                 |
| CJK                                      | `Inches(0.14–0.16)`            |
| Devanagari / Bengali (matras / conjuncts)| `Inches(0.16–0.18)`            |
| Thai / Lao / Khmer (tone marks above)    | `Inches(0.16–0.18)`            |
| Arabic / Hebrew                          | `Inches(0.13)`                 |

When the deck mixes scripts, take the max — line breathing-room is
visual, an under-spaced Thai run in an otherwise Latin deck reads as
"the Thai slide is broken".

> **Source for these numbers.** Measured against Noto Sans / Noto
> Serif / IBM Plex line-height at 14pt body with full diacritic stacks
> (e.g. Devanagari conjuncts ष्ट्र, Thai 4-mark sequences ก़ํ้, stacked
> Vietnamese ỗ). Adjust downward for condensed faces (Inter Condensed,
> Noto Sans Condensed) and upward for display sizes ≥ 24pt where
> diacritic ratios grow.

## Audit checklist

After re-export, confirm all five layers:

- [ ] Layer 1: Each CSS class in the HTML maps to the intended family
      in the export script's font table.
- [ ] Layer 2: All declared families exist on the rendering machine
      (`fc-list | grep`).
- [ ] Layer 3: No variable-font filename pretending to be a static
      family. `~/Library/Fonts/` shows multi-file static families for
      every face used.
- [ ] Layer 4: `unzip + grep typeface` returns only the design-system
      fonts. No `Microsoft JhengHei` / `Calibri` / `Arial` / `Georgia`
      / `Consolas` residue.
- [ ] Layer 5: No run from a no-italic script (CJK / Arabic / Hebrew /
      Devanagari / Thai) has `italic=True` set with a Latin italic
      face in the `<a:latin>` slot.
- [ ] **Beyond CJK:** RTL slides set `<a:rtl val="1"/>` on the
      paragraph's `pPr` — verify with:

      ```bash
      unzip -o deck.pptx -d /tmp/audit
      grep -h '<a:rtl' /tmp/audit/ppt/slides/*.xml | sort -u
      # Expect a hit for every fa / ar / he slide; empty output on
      # an RTL deck means the directionality wasn't propagated.
      ```

      Cursor `gap` is bumped per the line-height table above when the
      deck includes Vietnamese, Devanagari, Thai, or Khmer content.

If all five pass and the user still reports "the type looks wrong",
ask for a screenshot pointing at the specific glyph or word — the
remaining bugs are usually license-restricted fonts not embedded into
the file (see `SKILL.md` Step 5 verification).