first-commit

2026-05-04 14:58:14 -04:00
commit a46764fb1b
1210 changed files with 233231 additions and 0 deletions
@@ -0,0 +1,363 @@
+# Font Discipline for PPTX Exports
+
+Companion to `layout-discipline.md`. The rail / cursor primitives in that
+file catch geometric drift; this file catches the typography drift that
+geometry can't see — variable-font traps, missing CJK slots, fake italic
+on Han characters. These are the bugs that pass `verify_layout.py` and
+still look wrong.
+
+Read this when:
+
+- The audit table has 🟡 entries about italic / em / font fallback.
+- PowerPoint silently swaps to Calibri / Arial / Microsoft JhengHei /
+  Georgia after you specified a different family.
+- `unzip pptx | grep typeface` shows a face that isn't in your design system.
+
+## Layer 1 — Font mapping in the export script
+
+Walk each CSS class used by the source HTML and confirm the export
+script maps it to the **same** font family.
+
+⚠️ **Trap:** the visual category your eye reads is not always the
+class's semantic category. Editorial decks routinely bind `.lead`,
+`.callout`, or `.q-big` to a serif face, not the sans-serif you'd guess
+from "lead". Open the HTML's CSS, read the `font-family` declaration
+for each class, and copy the literal family name into the export's
+font table.
+
+Don't rely on visual intuition; rely on grep.
+
+> **Coverage gap for Latin-slot scripts (Cyrillic / Greek / Vietnamese).**
+> Russian / Ukrainian / Greek runs go through `<a:latin>`, not `<a:ea>` —
+> they use the Latin slot. Many display fonts (Playfair Display, Source
+> Serif 4) ship with weak or missing Cyrillic / Greek glyphs, and most
+> drop Vietnamese Extended diacritics (ếẫỡỗ). PowerPoint silently falls
+> back to Calibri / Times New Roman per missing glyph, producing
+> mid-paragraph face shifts that look like a styling bug.
+>
+> When mapping a CSS class to a Latin font, check the font actually
+> covers your scripts:
+>
+> ```bash
+> # macOS / Linux: list the unicode blocks a font supports
+> fc-query -f '%{charset}\n' "$(fc-match -f '%{file}\n' 'Playfair Display')" | head
+> ```
+>
+> ```powershell
+> # Windows: PowerShell + System.Drawing reads the registered family list
+> [System.Reflection.Assembly]::LoadWithPartialName("System.Drawing") | Out-Null
+> $f = New-Object System.Drawing.Text.PrivateFontCollection
+> # Coverage detail (Unicode ranges) is best read in fontforge:
+> # File → Open → pick the .ttf / .otf → Element → Font Info → OS/2 → Unicode Ranges.
+> ```
+>
+> Cross-platform fallback: open the font in fontforge → Element → Font Info → OS/2 → Unicode Ranges.
+>
+> If coverage is missing, either swap to a face that has it (e.g.
+> Inter / IBM Plex Sans for Cyrillic; Be Vietnam Pro for Vietnamese) or
+> set a different `<a:latin>` per language run.
+
+## Layer 2 — Font presence on the rendering machine
+
+PowerPoint uses the OS font cache. If the family name in your XML isn't
+installed, PowerPoint silently falls back. Check:
+
+```bash
+fc-list | grep -i "noto serif"            # Linux / WSL
+mdfind "kMDItemFSName == '*NotoSerif*'"   # macOS
+```
+
+```powershell
+# Windows (PowerShell)
+Get-ChildItem -Path "$env:WINDIR\Fonts","$env:LOCALAPPDATA\Microsoft\Windows\Fonts" `
+  -Filter "*NotoSerif*" -ErrorAction SilentlyContinue
+```
+
+Install missing families:
+
+```bash
+brew install --cask \
+  font-noto-serif-tc \
+  font-playfair-display \
+  font-source-serif-4 \
+  font-ibm-plex-mono
+```
+
+The `verify_layout.py` script can't see this — it only checks
+geometry. A standalone font audit step is required.
+
+## Layer 3 — Variable fonts vs. static families ← most common trap
+
+Modern fonts often ship as a **single variable file** containing all
+weights (`NotoSerifTC[wght].ttf`). Looks elegant, but PowerPoint Mac /
+Windows have spotty support:
+
+- macOS reports the variable font's family name as its **default static
+  instance** — usually ExtraLight or Regular.
+- PowerPoint asks the OS for "Noto Serif TC, weight 700"; the OS
+  reports the family as `Noto Serif TC ExtraLight`; PowerPoint can't
+  match → falls back to a system serif.
+
+Diagnose:
+
+```bash
+ls -la ~/Library/Fonts/ | grep -i NotoSerif
+```
+
+| What you see                           | Verdict                                 |
+| -------------------------------------- | --------------------------------------- |
+| One `*[wght].ttf` file                 | Variable. PowerPoint may not match.     |
+| Multiple `*-Regular.otf`, `*-Bold.otf` | Static family. Safe.                    |
+
+Fix by using the static family equivalent:
+
+| Don't use (variable)        | Use instead (static)              |
+| --------------------------- | --------------------------------- |
+| `Noto Serif TC` (variable)  | `Noto Serif CJK TC`               |
+| `Source Serif 4` (variable) | `Source Serif Pro` / `Source Serif 4` static instances |
+| `Inter` (variable)          | Per-weight `Inter Regular` / `Inter Bold` |
+
+After fixing the export, re-run `extract_pptx.py` and confirm the
+`font` field matches the static name.
+
+## Layer 4 — PPTX XML's three-language slots
+
+PowerPoint chooses a typeface per run by language script. Each run can
+declare three:
+
+| Attribute               | Used for                         |
+| ----------------------- | -------------------------------- |
+| `<a:latin typeface=…>`  | Latin script (a-z, A-Z, digits)  |
+| `<a:ea typeface=…>`     | East Asian (CJK) — **Chinese / Japanese / Korean go here** |
+| `<a:cs typeface=…>`     | Complex script (Arabic, Hebrew, Thai) |
+
+Audit a file:
+
+```bash
+unzip -o /path/to/deck.pptx -d /tmp/audit
+grep -h -oE 'typeface="[^"]+"' /tmp/audit/ppt/slides/slide*.xml | sort -u
+```
+
+Expected output: only the design-system fonts. If you see
+`Microsoft JhengHei`, `Calibri`, `Arial`, `Georgia`, `Consolas`,
+something has fallen back.
+
+**Common defect:** export script writes `<a:latin>` only. Chinese runs
+have no `<a:ea>` directive → PowerPoint picks the OS default
+(Microsoft JhengHei on Windows, Hiragino Sans on Mac). Result: Chinese
+characters in the wrong serif/sans family.
+
+Fix: when adding a run with mixed-language content, set all three
+attributes that apply.
+
+```python
+from pptx.oxml.ns import qn
+
+def set_run_fonts(run, latin: str | None = None, ea: str | None = None, cs: str | None = None):
+    rPr = run._r.get_or_add_rPr()
+    if latin:
+        el = rPr.find(qn('a:latin'))
+        if el is None:
+            el = rPr.makeelement(qn('a:latin'), {})
+            rPr.append(el)
+        el.set('typeface', latin)
+    if ea:
+        el = rPr.find(qn('a:ea'))
+        if el is None:
+            el = rPr.makeelement(qn('a:ea'), {})
+            rPr.append(el)
+        el.set('typeface', ea)
+    if cs:
+        el = rPr.find(qn('a:cs'))
+        if el is None:
+            el = rPr.makeelement(qn('a:cs'), {})
+            rPr.append(el)
+        el.set('typeface', cs)
+```
+
+PptxGenJS sets all three by default; raw XML injection or python-pptx
+without explicit `ea` slot does not.
+
+## Layer 5 — Italic + script interaction
+
+🚨 **`italic=True` is a Latin-script feature.** Apply it only to runs
+whose characters belong to scripts where italic is part of the writing
+tradition (Latin, Cyrillic, Greek). For everything else — CJK, Arabic,
+Hebrew, Devanagari, Thai, Khmer — PowerPoint synthesizes a slanted
+bitmap that looks mechanically deformed. The chain of failures, using
+CJK as the canonical example:
+
+1. `<a:latin>` slot has Playfair Display Italic (a Latin-only font).
+2. The CJK characters in the run have no glyph in Playfair → PowerPoint
+   substitutes a system CJK font.
+3. The substituted CJK font is forced into `italic=True` → since no
+   real CJK italic exists, PowerPoint synthesizes a slanted bitmap →
+   characters look mechanically deformed.
+
+The same pattern triggers for Arabic, Hebrew, Devanagari, and Thai —
+none of these scripts has an italic tradition, and faking it produces
+a slant that's visually broken.
+
+**Rule:** italic only applies to runs whose primary script supports it
+(Latin / Cyrillic / Greek). Indicate emphasis on other scripts via:
+
+- color tone (`COLOR_INK_60` for muted, full ink for emphasis)
+- weight contrast (Regular 400 vs. Bold 700)
+- a script-native italic variant **only if one actually ships** — most
+  don't
+
+Practical implementation:
+
+```python
+# Unicode ranges where italic should be suppressed.
+# Principle: include scripts whose writing tradition has no italic style.
+# Synthesized italic on these scripts produces a slanted bitmap that looks
+# mechanically deformed.
+NO_ITALIC_RANGES = (
+    (0x3400, 0x9FFF),    # CJK Unified Ideographs
+    (0xF900, 0xFAFF),    # CJK Compatibility Ideographs
+    (0x3040, 0x30FF),    # Hiragana + Katakana
+    (0xAC00, 0xD7AF),    # Hangul Syllables
+    (0x0590, 0x05FF),    # Hebrew
+    (0x0600, 0x06FF),    # Arabic
+    (0x0750, 0x077F),    # Arabic Supplement
+    # Indic scripts — none have an italic tradition; PowerPoint synthesizes
+    # a fake slant on all of them. Add new ranges here when the deck mixes
+    # in additional scripts (e.g. Sinhala U+0D80–U+0DFF).
+    (0x0900, 0x097F),    # Devanagari (Hindi, Marathi, Sanskrit)
+    (0x0980, 0x09FF),    # Bengali
+    (0x0A00, 0x0A7F),    # Gurmukhi (Punjabi)
+    (0x0A80, 0x0AFF),    # Gujarati
+    (0x0B00, 0x0B7F),    # Oriya
+    (0x0B80, 0x0BFF),    # Tamil
+    (0x0C00, 0x0C7F),    # Telugu
+    (0x0C80, 0x0CFF),    # Kannada
+    (0x0D00, 0x0D7F),    # Malayalam
+    # Southeast Asian
+    (0x0E00, 0x0E7F),    # Thai
+    (0x0E80, 0x0EFF),    # Lao
+    (0x1780, 0x17FF),    # Khmer
+)
+
+
+def has_no_italic_script(text: str) -> bool:
+    return any(
+        any(lo <= ord(c) <= hi for lo, hi in NO_ITALIC_RANGES)
+        for c in text
+    )
+
+
+def add_run_with_italic_safety(p, text, *, latin_face: str, ea_face: str,
+                               cs_face: str | None, size_pt: int,
+                               italic: bool, **kwargs):
+    """Drop italic if the run contains characters from scripts without italic tradition.
+
+    Args:
+        latin_face: Font for Latin / Cyrillic / Greek runs (a:latin slot).
+        ea_face: Font for CJK runs (a:ea slot).
+        cs_face: Font for complex scripts — Arabic, Hebrew, Devanagari,
+            Thai, etc. (a:cs slot). Pass None when the run contains no
+            complex-script characters; set_run_fonts skips the slot.
+    """
+    r = p.add_run()
+    r.text = text
+    r.font.size = Pt(size_pt)
+    r.font.italic = italic and not has_no_italic_script(text)
+    set_run_fonts(r, latin=latin_face, ea=ea_face, cs=cs_face)
+    return r
+```
+
+For mixed-script runs (e.g. `"In <em>2026</em> 開始"`), split into
+multiple runs at language boundaries so the italic attribute can apply
+to the Latin run only.
+
+## Beyond CJK — other scripts
+
+The five layers above are written in CJK examples because that's the
+most common pairing in Open Design today, but the same machinery
+applies to other scripts. Quick reference:
+
+| Script family            | XML slot   | Italic OK? | Most common defect                                                                  | Recommended faces                                |
+| ------------------------ | ---------- | ---------- | ----------------------------------------------------------------------------------- | ------------------------------------------------ |
+| Latin (en, de, es, vi…)  | `a:latin`  | ✅          | Vietnamese Extended diacritics dropped → fallback Calibri mid-paragraph             | Be Vietnam Pro, IBM Plex Sans, Source Sans 3     |
+| Cyrillic (ru, uk, bg)    | `a:latin`  | ✅          | Display fonts (Playfair, Source Serif) lack Cyrillic → fallback Calibri             | Inter, IBM Plex Sans, Roboto                     |
+| Greek (el)               | `a:latin`  | ✅          | Same as Cyrillic — display faces missing Greek → fallback                           | Inter, IBM Plex Sans                             |
+| CJK (zh, ja, ko)         | `a:ea`     | ❌          | Variable-font trap (Layer 3); missing `a:ea` slot → fallback Microsoft JhengHei     | Noto Sans CJK *, Source Han Sans, IBM Plex Sans JP |
+| Arabic / Hebrew / Persian | `a:cs`    | ❌          | `<a:rtl val="1"/>` not set → text direction breaks; kashida changes width           | Noto Naskh Arabic, IBM Plex Sans Arabic, Amiri   |
+| Devanagari / Bengali     | `a:cs`     | ❌          | PowerPoint defaults to Mangal/Vrinda (low fidelity); cluster shaping bumps line height | Noto Sans Devanagari, Mukta, Hind             |
+| Thai / Lao / Khmer       | `a:cs`     | ❌          | No inter-word spaces → PowerPoint's break engine produces poor wraps; tone marks bump line height | Noto Sans Thai, Sarabun, Noto Sans Khmer  |
+
+For RTL scripts (Arabic / Hebrew / Persian), set both `<a:cs typeface=…>`
+and `<a:rtl val="1"/>` on the run's `rPr`. Right-alignment, bidi text
+flow, and chrome / footer mirroring are out of scope for `verify_layout.py`
+today and need manual review — see the Tier 2 follow-up note in the
+audit checklist.
+
+> **RTL discipline scope.** Full RTL support is roughly 15–20% of the
+> font + layout discipline surface area: Unicode TR9 bidi resolution,
+> chrome / footer / page-number mirroring, kashida (Arabic
+> elongation) interaction with line-fill, and right-anchored
+> alignment. This skill covers the typeface + slot mechanics only;
+> bidi and mirroring are flagged for a Tier 2 `rtl-discipline.md`
+> follow-up when fa / ar / he usage volume justifies the investment.
+
+## Line height per script
+
+The `Cursor.take(gap=Inches(0.12))` default suits 14pt Latin body copy.
+Other scripts need more vertical headroom because of stacked diacritics,
+matras, or tone marks:
+
+| Script                                   | Recommended `gap` at 14pt body |
+| ---------------------------------------- | ------------------------------ |
+| Latin (no Vietnamese Extended)           | `Inches(0.12)` (default)       |
+| Latin (with Vietnamese Extended ếẫỗ)     | `Inches(0.14)`                 |
+| CJK                                      | `Inches(0.14–0.16)`            |
+| Devanagari / Bengali (matras / conjuncts)| `Inches(0.16–0.18)`            |
+| Thai / Lao / Khmer (tone marks above)    | `Inches(0.16–0.18)`            |
+| Arabic / Hebrew                          | `Inches(0.13)`                 |
+
+When the deck mixes scripts, take the max — line breathing-room is
+visual, an under-spaced Thai run in an otherwise Latin deck reads as
+"the Thai slide is broken".
+
+> **Source for these numbers.** Measured against Noto Sans / Noto
+> Serif / IBM Plex line-height at 14pt body with full diacritic stacks
+> (e.g. Devanagari conjuncts ष्ट्र, Thai 4-mark sequences ก़ํ้, stacked
+> Vietnamese ỗ). Adjust downward for condensed faces (Inter Condensed,
+> Noto Sans Condensed) and upward for display sizes ≥ 24pt where
+> diacritic ratios grow.
+
+## Audit checklist
+
+After re-export, confirm all five layers:
+
+- [ ] Layer 1: Each CSS class in the HTML maps to the intended family
+      in the export script's font table.
+- [ ] Layer 2: All declared families exist on the rendering machine
+      (`fc-list | grep`).
+- [ ] Layer 3: No variable-font filename pretending to be a static
+      family. `~/Library/Fonts/` shows multi-file static families for
+      every face used.
+- [ ] Layer 4: `unzip + grep typeface` returns only the design-system
+      fonts. No `Microsoft JhengHei` / `Calibri` / `Arial` / `Georgia`
+      / `Consolas` residue.
+- [ ] Layer 5: No run from a no-italic script (CJK / Arabic / Hebrew /
+      Devanagari / Thai) has `italic=True` set with a Latin italic
+      face in the `<a:latin>` slot.
+- [ ] **Beyond CJK:** RTL slides set `<a:rtl val="1"/>` on the
+      paragraph's `pPr` — verify with:
+
+      ```bash
+      unzip -o deck.pptx -d /tmp/audit
+      grep -h '<a:rtl' /tmp/audit/ppt/slides/*.xml | sort -u
+      # Expect a hit for every fa / ar / he slide; empty output on
+      # an RTL deck means the directionality wasn't propagated.
+      ```
+
+      Cursor `gap` is bumped per the line-height table above when the
+      deck includes Vietnamese, Devanagari, Thai, or Khmer content.
+
+If all five pass and the user still reports "the type looks wrong",
+ask for a screenshot pointing at the specific glyph or word — the
+remaining bugs are usually license-restricted fonts not embedded into
+the file (see `SKILL.md` Step 5 verification).