Why do Arabic numerals reverse in my spatial PDF output?

The Unicode Bidirectional Algorithm classifies ASCII digits as weak characters. Without an explicit bidi-override or unicode-bidi: isolate-override, numerals embedded in RTL text reorder incorrectly. Apply HarfBuzz buffer direction settings and explicit unicode-bidi CSS properties on your layout templates to lock directionality.

What font families work best for spatial reports with mixed scripts?

Google Noto (including Noto Sans Arabic, Noto Sans SC/JP/KR, Noto Sans Devanagari) provides harmonised x-heights, consistent optical sizing, and near-complete Unicode coverage, making it the most reliable choice for automated spatial reporting pipelines that produce multi-language PDFs.

How do I reduce PDF file size when embedding multi-script fonts?

Enable font subsetting in your renderer so only glyphs actually used in the document are embedded. WeasyPrint subsets fonts automatically; in ReportLab pass TTFont with subsetting enabled. Caching subsetted font binaries across pipeline runs prevents repeated file-system reads during batch generation.

Typography Mapping for Multi-Language Spatial Data

Automated spatial reporting pipelines fail silently when attribute tables contain mixed-script place names, administrative labels, or multilingual metadata. Without an explicit script-to-font resolution layer, spatial PDFs and print outputs accumulate missing glyphs, reversed text flow, and unpredictable line breaks that corrupt downstream layout engines. This guide builds a production-ready typography mapping pipeline — from Unicode classification through shaped glyph delivery — that sits within the broader Document Architecture & Layout Rules for Spatial Reports framework and guarantees deterministic rendering across Latin, Cyrillic, Arabic, CJK, and Indic scripts.

Prerequisites

Python 3.10+ with uharfbuzz, python-bidi, fonttools, and PyMuPDF installed
A layout renderer that supports programmatic font dictionaries: WeasyPrint, ReportLab, or a Cairo-based stack
Spatial datasets stored in UTF-8, normalised to NFC form; mixed normalisation causes silent glyph substitution failures in shaping engines
Each text field tagged with an ISO 639-1/2 language code or script identifier (ar, zh-Hans, ru, ja); if tags are absent, a Unicode block classifier must precede the pipeline
Licensed font families with comprehensive Unicode coverage — Google Noto is the baseline for this guide

Bash

pip install uharfbuzz python-bidi fonttools PyMuPDF weasyprint reportlab

Assumed prior knowledge: how CSS Grid Systems for Report Layouts control column widths that typography overflow can violate, and how Print-Ready Page Sizing Standards for GIS Reports constrain the available text box area across A4, Letter, and custom map sheet formats.

Pipeline Architecture

The diagram below shows the end-to-end data flow from raw spatial attributes through to embedded font glyphs in the final document.

Step-by-Step Implementation

Step 1 — Inventory and Classify Language Attributes

Extract every text-bearing field from your spatial dataset (GeoPackage, Shapefile, PostGIS, or flat CSV) and build a classification table that links each feature to its primary script, fallback script, and base directionality (LTR or RTL). This table becomes the routing layer for every subsequent pipeline stage.

Python

import unicodedata
from pathlib import Path
from collections import Counter
import geopandas as gpd

SCRIPT_BLOCK_MAP: dict[str, str] = {
    "LATIN": "Latin",
    "ARABIC": "Arabic",
    "CJK": "CJK",
    "HANGUL": "CJK",
    "HIRAGANA": "CJK",
    "KATAKANA": "CJK",
    "DEVANAGARI": "Devanagari",
    "CYRILLIC": "Cyrillic",
    "HEBREW": "Hebrew",
    "THAI": "Thai",
}

RTL_SCRIPTS: frozenset[str] = frozenset({"Arabic", "Hebrew"})


def detect_primary_script(text: str) -> str:
    """Return the dominant Unicode script for a string."""
    counts: Counter[str] = Counter()
    for ch in text:
        if ch.isspace():
            continue
        block = unicodedata.name(ch, "").split(" ")[0]
        script = SCRIPT_BLOCK_MAP.get(block, "Latin")
        counts[script] += 1
    return counts.most_common(1)[0][0] if counts else "Latin"


def classify_gdf(gdf: gpd.GeoDataFrame, text_cols: list[str]) -> gpd.GeoDataFrame:
    """Add primary_script, fallback_script, and direction columns."""
    all_text = gdf[text_cols].fillna("").apply(
        lambda row: " ".join(str(v) for v in row), axis=1
    )
    gdf = gdf.copy()
    gdf["primary_script"] = all_text.map(detect_primary_script)
    gdf["direction"] = gdf["primary_script"].map(
        lambda s: "rtl" if s in RTL_SCRIPTS else "ltr"
    )
    gdf["fallback_script"] = "Latin"  # default; override per-project if needed
    return gdf

When attribute strings exceed expected character counts they affect column widths and wrapping behaviour. Classifying at this stage prevents overflow that violates the constraints described in Margin and Bleed Alignment in Automated PDFs.

Step 2 — Build Script-Aware Font Fallback Chains

Font fallback is an ordered resolution strategy, not a single lookup. Each script maps to a primary typeface, secondary fallbacks, and a system-safe default. Define these chains centrally so every pipeline stage draws from the same authoritative source.

Python

from dataclasses import dataclass, field

@dataclass
class FontChain:
    primary: str
    fallbacks: list[str] = field(default_factory=list)
    system_default: str = "sans-serif"

FONT_CHAINS: dict[str, FontChain] = {
    "Latin":      FontChain("Inter",              ["NotoSans-Regular",    "Arial"]),
    "Cyrillic":   FontChain("Inter",              ["NotoSans-Regular",    "Arial"]),
    "Arabic":     FontChain("NotoSansArabic",     ["Arial Unicode MS"]),
    "Hebrew":     FontChain("NotoSansHebrew",     ["Arial Unicode MS"]),
    "CJK":        FontChain("NotoSansSC",         ["NotoSansJP", "NotoSansKR", "SimSun"]),
    "Devanagari": FontChain("NotoSansDevanagari", ["Arial Unicode MS"]),
    "Thai":       FontChain("NotoSansThai",       ["Arial Unicode MS"]),
}


def resolve_font(script: str, glyph: str, font_dir: Path) -> str:
    """Walk the fallback chain until a font covers the required glyph."""
    from fonttools.ttLib import TTFont

    chain = FONT_CHAINS.get(script, FONT_CHAINS["Latin"])
    candidates = [chain.primary] + chain.fallbacks

    codepoint = ord(glyph)
    for name in candidates:
        font_path = font_dir / f"{name}.ttf"
        if not font_path.exists():
            continue
        tt = TTFont(str(font_path))
        cmap = tt.getBestCmap()
        if cmap and codepoint in cmap:
            return name
    return chain.system_default

In CSS-driven renderers like WeasyPrint, express the same logic with @font-face + unicode-range descriptors. The unicode-range descriptor lets the CSS engine skip downloading fallback fonts entirely when the primary covers all required codepoints, cutting memory overhead during batch generation.

Step 3 — Implement Text Shaping and Directionality

Raw Unicode code points are not rendered text. Shaping engines apply script-specific rules: Arabic requires initial, medial, final, and isolated forms; Devanagari requires conjunct consonants and vowel matras; CJK requires proportional spacing and ideographic punctuation handling. Integrate uharfbuzz to process each classified string before it reaches the layout engine.

Python

import uharfbuzz as hb
from bidi.algorithm import get_display


SCRIPT_HB_MAP: dict[str, str] = {
    "Arabic":     "arab",
    "Hebrew":     "hebr",
    "Devanagari": "deva",
    "CJK":        "hani",
    "Thai":       "thai",
    "Latin":      "latn",
    "Cyrillic":   "cyrl",
}


def shape_text(text: str, script: str, font_path: str) -> list[tuple[int, tuple[int, int]]]:
    """
    Shape a string with HarfBuzz and return (glyph_id, (x_advance, y_advance)) pairs.
    Apply BiDi reordering first for RTL scripts.
    """
    if script in RTL_SCRIPTS:
        text = get_display(text)  # python-bidi: resolve logical → visual order

    blob = hb.Blob.from_file_path(font_path)
    face = hb.Face(blob)
    font = hb.Font(face)

    buf = hb.Buffer()
    buf.add_str(text)
    buf.guess_segment_properties()
    buf.script = hb.script_from_string(SCRIPT_HB_MAP.get(script, "latn"))
    buf.direction = hb.direction_from_string("rtl" if script in RTL_SCRIPTS else "ltr")

    hb.shape(font, buf)

    infos = buf.glyph_infos
    positions = buf.glyph_positions
    return [(info.codepoint, (pos.x_advance, pos.y_advance)) for info, pos in zip(infos, positions)]

Directionality errors cascade into page geometry miscalculations. When RTL text shifts unexpectedly it can push tables into bleed zones or overlap map legends. Enforce explicit unicode-bidi: isolate-override in your layout templates to lock text flow to the script’s natural direction.

Step 4 — Integrate with Layout and Rendering Engines

Once strings are classified, shaped, and directionally resolved, pass them to the rendering framework. The configuration decisions differ by engine.

WeasyPrint (CSS-driven):

CSS

/* typography-pipeline.css */
@font-face {
  font-family: "SpatialReport";
  src: url("fonts/Inter.woff2") format("woff2");
  unicode-range: U+0000-024F;           /* Latin + Latin Extended */
}
@font-face {
  font-family: "SpatialReport";
  src: url("fonts/NotoSansArabic.woff2") format("woff2");
  unicode-range: U+0600-06FF, U+0750-077F; /* Arabic blocks */
}
@font-face {
  font-family: "SpatialReport";
  src: url("fonts/NotoSansSC.woff2") format("woff2");
  unicode-range: U+4E00-9FFF;           /* CJK Unified Ideographs */
}

[lang="ar"], [dir="rtl"] {
  font-family: "SpatialReport", sans-serif;
  direction: rtl;
  unicode-bidi: isolate;
  text-align: right;
  text-justify: auto;            /* kashida elongation where supported */
}

[lang|="zh"], [lang|="ja"], [lang|="ko"] {
  font-family: "SpatialReport", sans-serif;
  word-break: keep-all;
  line-break: strict;
  text-align: left;              /* avoid CJK character stretching */
}

ReportLab (programmatic):

Python

from reportlab.lib.styles import ParagraphStyle
from reportlab.lib.enums import TA_RIGHT, TA_LEFT, TA_JUSTIFY
from reportlab.pdfbase import pdfmetrics
from reportlab.pdfbase.ttfonts import TTFont
from pathlib import Path


def register_font_chain(font_dir: Path) -> None:
    """Register Noto fonts with subsetting enabled."""
    registrations = [
        ("NotoSansArabic", "NotoSansArabic-Regular.ttf"),
        ("NotoSansSC",     "NotoSansSC-Regular.ttf"),
        ("NotoSansDevanagari", "NotoSansDevanagari-Regular.ttf"),
    ]
    for alias, filename in registrations:
        path = font_dir / filename
        if path.exists():
            pdfmetrics.registerFont(TTFont(alias, str(path)))


def make_paragraph_style(script: str) -> ParagraphStyle:
    """Return a ParagraphStyle tuned for the given script."""
    base = {
        "fontName": FONT_CHAINS.get(script, FONT_CHAINS["Latin"]).primary,
        "fontSize": 10,
        "leading": 14,
        "spaceAfter": 4,
    }
    if script in RTL_SCRIPTS:
        return ParagraphStyle("RTL", alignment=TA_RIGHT, wordWrap="RTL", **base)
    if script == "CJK":
        return ParagraphStyle("CJK", alignment=TA_LEFT, wordWrap="CJK", **base)
    return ParagraphStyle("Latin", alignment=TA_JUSTIFY, **base)

Enable font subsetting in both engines — it embeds only the glyphs present in the document, reducing PDF file size by 60–80% for large multilingual datasets. WeasyPrint subsets automatically; ReportLab subsets when you pass TTFont objects rather than pre-registered names.

Step 5 — Validate and Gate in CI/CD

Typography mapping requires automated validation before deployment to production.

Python

import fitz  # PyMuPDF
from pathlib import Path
import logging

log = logging.getLogger(__name__)


def audit_glyph_coverage(pdf_path: Path, min_coverage: float = 0.995) -> bool:
    """
    Parse an output PDF and flag any .notdef (tofu) glyphs.
    Returns True when coverage meets the threshold.
    """
    doc = fitz.open(str(pdf_path))
    total = 0
    missing = 0

    for page in doc:
        blocks = page.get_text("rawdict")["blocks"]
        for block in blocks:
            for line in block.get("lines", []):
                for span in line.get("spans", []):
                    for char in span.get("chars", []):
                        total += 1
                        if char.get("c", "") in ("", "□", ""):
                            missing += 1
                            log.warning("Tofu glyph on page %d: %r", page.number + 1, char)

    doc.close()
    coverage = (total - missing) / max(total, 1)
    log.info("Glyph coverage: %.2f%% (%d/%d)", coverage * 100, total - missing, total)
    return coverage >= min_coverage

Add audit_glyph_coverage as a CI step that blocks merges when coverage drops below 99.5%. Pair it with a BiDi visual-order check: render a set of canonical test strings (mixed LTR/RTL, numerals embedded in Arabic, Latin headers above CJK body text) and compare against reference screenshots using pixelmatch or Pillow diff.

Production-Ready Script

Python

#!/usr/bin/env python3
"""
typography_pipeline.py — end-to-end typography mapping for multilingual spatial PDFs.
Usage: python typography_pipeline.py --input data/regions.gpkg --output report.pdf
"""
import argparse
import logging
import sys
from pathlib import Path

import geopandas as gpd
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer
from reportlab.lib.pagesizes import A4

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger(__name__)

TEXT_COLS = ["name", "admin_label", "notes"]
FONT_DIR = Path("fonts")
OUTPUT_MARGIN = 50


def run(input_path: Path, output_path: Path) -> None:
    gdf = gpd.read_file(str(input_path))
    gdf = classify_gdf(gdf, TEXT_COLS)
    register_font_chain(FONT_DIR)

    doc = SimpleDocTemplate(
        str(output_path),
        pagesize=A4,
        leftMargin=OUTPUT_MARGIN,
        rightMargin=OUTPUT_MARGIN,
        topMargin=OUTPUT_MARGIN,
        bottomMargin=OUTPUT_MARGIN,
    )
    story = []

    for _, row in gdf.iterrows():
        script: str = row.get("primary_script", "Latin")
        style = make_paragraph_style(script)
        for col in TEXT_COLS:
            val = str(row.get(col, "") or "")
            if val.strip():
                story.append(Paragraph(val, style))
                story.append(Spacer(1, 4))

    doc.build(story)
    log.info("Written: %s", output_path)

    if not audit_glyph_coverage(output_path):
        log.error("Glyph coverage below threshold — review font chains")
        sys.exit(1)


if __name__ == "__main__":
    parser = argparse.ArgumentParser(description="Multilingual spatial typography pipeline")
    parser.add_argument("--input",  type=Path, required=True, help="GeoPackage or Shapefile path")
    parser.add_argument("--output", type=Path, default=Path("report.pdf"))
    args = parser.parse_args()
    run(args.input, args.output)

Edge Cases and Advanced Configuration

Null and Empty Attributes

Some features will have None or empty string values in text columns. Guard every string operation with str(val or "") before passing to the classifier. Log nulls by feature ID so spatial analysts can investigate whether the gap is a data quality issue or an expected sparse attribute.

Mixed-Script Strings Within a Single Cell

A single place name may mix scripts — for example a bilingual label 北京 / Beijing. Run character-level classification and apply <span> wrappers with per-segment lang attributes in HTML renderers. In ReportLab, split on script boundaries and render each segment as a separate Paragraph with the appropriate style, then join them in a KeepTogether flowable.

Large Batch Processing

For datasets with 10,000+ features, cache fonttools TTFont objects across rows — parsing a font file is expensive. Store resolved (glyph, font_name) pairs in a lru_cache or an in-process SQLite table. This reduces per-feature overhead from ~50 ms to under 1 ms on warm cache.

Headless Environments

Docker containers used in CI/CD pipelines often lack system fonts. Add a COPY fonts/ /usr/share/fonts/truetype/spatial/ step to your Dockerfile and run fc-cache -f -v to register fonts with Fontconfig. WeasyPrint discovers fonts through Fontconfig on Linux; ReportLab requires explicit path registration via pdfmetrics.registerFont.

DOCKERFILE

FROM python:3.11-slim
RUN apt-get update && apt-get install -y libpango-1.0-0 libpangoft2-1.0-0 && rm -rf /var/lib/apt/lists/*
COPY fonts/ /usr/share/fonts/truetype/spatial/
RUN fc-cache -f -v
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . /app
WORKDIR /app

Troubleshooting

Symptom	Likely Cause	Resolution
Tofu boxes (`□`) in output	Primary font missing the glyph; fallback chain not triggered	Verify `unicode-range` coverage; reorder fallback priority; run `hb-view` to inspect HarfBuzz output directly
Reversed numerals in Arabic text	BiDi embedding not applied; ASCII digits classified as weak characters	Apply `unicode-bidi: isolate-override`; tag numerals as LTR within RTL context in `python-bidi`
CJK line breaks splitting mid-word	Default Latin hyphenation rules active	Set `line-break: strict` and `word-break: keep-all`; disable automatic hyphenation for CJK Unicode blocks
Inconsistent x-height across scripts	Optical sizing mismatch between font families	Switch to unified Noto Sans family; adjust `font-size-adjust` or scale `cap-height` per script
PDF file size bloat	Full font embedding instead of subsetting	Enable subsetting flag in renderer; cache subsetted font binaries across pipeline runs
Diacritics clipped at line tops	Line-height set for Latin metrics only	Increase `leading` by 20% for Arabic/Devanagari; set `line-height: 1.8` in CSS for diacritic-heavy scripts

Detailed Guides in This Section

Embedding CJK Fonts for Multilingual Map Labels — install and declare Noto Sans CJK / Source Han via @font-face so Chinese, Japanese and Korean place names render correctly in WeasyPrint output.
Subsetting WOFF2 Fonts to Shrink Report PDFs — cut embedded font weight with fonttools pyftsubset and unicode-range subsetting without losing required glyphs.

CSS Grid Systems for Report Layouts — column and row definitions that typography overflow can violate
Margin and Bleed Alignment in Automated PDFs — bleed-zone constraints that RTL text flow errors push text into
Print-Ready Page Sizing Standards for GIS Reports — page geometry that constrains the available text box for multilingual labels
Dynamic Legend Injection for Variable Datasets — legend text that shares the same font resolution pipeline
Table Pagination Strategies for Large Attribute Tables — pagination logic that depends on accurate text metrics produced by shaping

Parent: Document Architecture & Layout Rules for Spatial Reports

Embedding a deterministic typography mapping layer into your reporting architecture means that place names, administrative boundaries, and metadata remain legible and publication-ready regardless of the script mix in incoming spatial data. Combined with the page geometry and bleed controls defined in the parent Document Architecture & Layout Rules for Spatial Reports section, this pipeline guarantees that every multilingual attribute surfaces correctly in print and digital outputs.

Explore this section