Why do my table rows get clipped at PDF page boundaries?

ReportLab splits rows by default when a row exceeds remaining page height. Set splitByRow=0 on the Table constructor to enforce atomic rows that push entirely to the next page.

How do I repeat column headers on every page in ReportLab?

Pass repeatRows=1 to the Table constructor. This tells Platypus to re-render the first row at the top of each new page after a split or explicit PageBreak.

What causes MemoryError when paginating large GIS tables?

Loading the full DataFrame into memory before chunking is the primary cause. Use pandas.read_csv with chunksize, a database cursor, or a generator-based pipeline to stream rows into the rendering loop.

Table Pagination Strategies for Large Attribute Tables

Attribute tables in automated spatial reports routinely exceed single-page constraints. Environmental compliance dossiers, infrastructure inventories, and cadastral audits frequently contain thousands of records that must be rendered alongside cartographic outputs — and the layout engine must get every page boundary exactly right. Without deterministic pagination, footers drift, column headers vanish after the first page, and row data is sliced mid-cell, breaking coordinate strings and numeric references that downstream auditors rely on.

This page covers production-tested Python patterns that solve the pagination problem once and keep it solved as datasets scale. If you are new to how these tables fit into a broader pipeline, see Dynamic Map & Data Embedding Workflows for the architecture context.

Prerequisites

Requirement	Minimum version	Notes
Python	3.10+	Type-hint syntax used throughout
`pandas`	2.0.0	DataFrame normalization and chunking
`geopandas`	0.14	Spatial joins and CRS metadata
`reportlab`	3.6.0	Platypus layout engine (primary target)
`fpdf2`	2.7.0	Lightweight alternative for simpler layouts
`weasyprint`	59.0	HTML/CSS paged-media fallback
`libcairo2`	system	Required on Linux for vector rendering backends

Bash

pip install pandas geopandas reportlab fpdf2

Assumed prior knowledge: familiarity with pandas.DataFrame operations, Python type hints, and the concept of a PDF layout engine’s document-flow model. If your pipeline also embeds cartographic outputs, see Automated Static Map Generation from GeoJSON for viewport and CRS normalization patterns that must be resolved before the pagination layer runs.

Pipeline Architecture

The diagram below shows the end-to-end data flow from raw attribute data to a paginated PDF. Each stage is a discrete, testable unit — this separation is what makes the pipeline deterministic.

Step-by-Step Implementation

Step 1 — Schema Normalisation and Type Casting

Flatten any nested JSON or GeoJSON attributes into a flat pandas.DataFrame. Drop the geometry column (or serialize it to WKT if it is required in the table). Cast every column to str and fill NaN with an empty string. This prevents type-coercion errors during the PDF text-measurement pass that follows.

Python

import geopandas as gpd
import pandas as pd

def load_attribute_table(path: str) -> pd.DataFrame:
    """Load a spatial file and return a flat, string-typed DataFrame."""
    gdf = gpd.read_file(path)
    df = pd.DataFrame(gdf.drop(columns="geometry"))
    return df.astype(str).fillna("").reset_index(drop=True)

Complex nested objects (lists, dicts, arrays) inside attribute columns must be serialised with json.dumps or excluded. Embedding raw Python repr strings into a PDF cell is a common source of layout crashes.

Step 2 — Font Metrics and Row-Height Calculation

ReportLab’s Platypus engine does not automatically derive dynamic row heights from cell content. You must measure every cell explicitly using pdfmetrics.stringWidth() and derive the number of wrapped lines for the expected column width.

Python

from reportlab.pdfbase import pdfmetrics

def compute_row_heights(
    data: list[list[str]],
    font_name: str,
    font_size: int,
    col_width: float,
    v_padding: int = 6,
) -> list[float]:
    """Return per-row heights in points."""
    heights: list[float] = []
    for row in data:
        max_h = float(font_size + v_padding)
        for cell in row:
            w = pdfmetrics.stringWidth(str(cell), font_name, font_size)
            lines = max(1, int(w / col_width) + 1)
            cell_h = float((font_size + 2) * lines + v_padding)
            max_h = max(max_h, cell_h)
        heights.append(max_h)
    return heights

The col_width parameter should match the actual column width allocated by your Table object. Mismatching these values is the most common cause of text overflowing cell boundaries.

Step 3 — Chunking and Boundary Detection

Iterate through the pre-computed row heights and accumulate a running total. When the running total would exceed the available vertical space on the page, record the boundary index and start a new chunk. This guarantees deterministic, measurement-based page breaks rather than engine heuristics.

The diagram below illustrates how the boundary detection algorithm maps row heights to page slices, with each chunk fitting within the available vertical space.

Python

from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch

def chunk_by_height(
    data: list[list[str]],
    row_heights: list[float],
    top_margin: float = 1.0 * inch,
    bottom_margin: float = 1.0 * inch,
    header_height: float = 20.0,
) -> list[list[list[str]]]:
    """Split data into page-sized chunks."""
    page_h = letter[1]
    available = page_h - top_margin - bottom_margin - header_height

    chunks: list[list[list[str]]] = []
    current: list[list[str]] = []
    running = 0.0

    for row, h in zip(data, row_heights):
        if running + h > available and current:
            chunks.append(current)
            current = []
            running = 0.0
        current.append(row)
        running += h

    if current:
        chunks.append(current)
    return chunks

Step 4 — Header Repetition and Context Preservation

When tables span multiple pages, column headers must reappear at the top of each new page. In ReportLab, repeatRows=1 on the Table constructor is the authoritative mechanism — it is not a style rule and is not set via TableStyle. Spatial metadata (CRS string, map extent, report date) belongs in page footers so downstream reviewers always have coordinate context alongside the data rows. This footer-injection pattern applies the same discipline as Dynamic Legend Injection for Variable Datasets, where legend keys must persist across pages alongside their corresponding spatial content.

Python

from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

def build_chunk_table(
    headers: list[str],
    chunk_data: list[list[str]],
    font_name: str = "Helvetica",
    font_size: int = 9,
) -> Table:
    table_data = [headers] + chunk_data
    t = Table(table_data, repeatRows=1, splitByRow=0)
    t.setStyle(TableStyle([
        ("FONTNAME",      (0, 0), (-1, 0),  font_name + "-Bold"),
        ("FONTSIZE",      (0, 0), (-1, -1), font_size),
        ("BACKGROUND",    (0, 0), (-1, 0),  colors.HexColor("#1a3a4a")),
        ("TEXTCOLOR",     (0, 0), (-1, 0),  colors.white),
        ("ROWBACKGROUNDS",(0, 1), (-1, -1), [colors.white, colors.HexColor("#f4f7fa")]),
        ("GRID",          (0, 0), (-1, -1), 0.4, colors.HexColor("#c0c8d0")),
        ("VALIGN",        (0, 0), (-1, -1), "TOP"),
        ("TOPPADDING",    (0, 0), (-1, -1), 4),
        ("BOTTOMPADDING", (0, 0), (-1, -1), 4),
        ("LEFTPADDING",   (0, 0), (-1, -1), 6),
        ("RIGHTPADDING",  (0, 0), (-1, -1), 6),
    ]))
    return t

Step 5 — Layout Assembly and Flow Control

Render chunks sequentially. Insert a PageBreak between chunks but not after the final one — a trailing PageBreak appends a blank page that will fail automated PDF validation tools. Wrap each Table in a KeepInFrame or use splitByRow=0 (already set above) to prevent mid-chunk splits.

Python

from reportlab.platypus import SimpleDocTemplate, PageBreak

def assemble_pdf(
    chunks: list[list[list[str]]],
    headers: list[str],
    output_path: str,
    font_name: str = "Helvetica",
    font_size: int = 9,
) -> str:
    doc = SimpleDocTemplate(output_path, pagesize=letter)
    elements = []
    for idx, chunk_data in enumerate(chunks):
        elements.append(build_chunk_table(headers, chunk_data, font_name, font_size))
        if idx < len(chunks) - 1:
            elements.append(PageBreak())
    doc.build(elements)
    return output_path

Production-Ready Script

The script below integrates all five stages with logging, configurable parameters, and graceful error handling. It is suitable for drop-in use in a CI/CD pipeline or a cron-driven batch job.

Python

"""
paginate_attribute_table.py
----------------------------
Paginate a GIS attribute table (shapefile / GeoJSON / GeoPackage) into a
multi-page PDF with deterministic page boundaries and repeating column headers.

Usage:
    python paginate_attribute_table.py \
        --input parcels.geojson \
        --output report.pdf \
        --font-size 9 \
        --col-width 120
"""

import argparse
import logging
import sys
from pathlib import Path

import geopandas as gpd
import pandas as pd
from reportlab.lib import colors
from reportlab.lib.pagesizes import letter
from reportlab.lib.units import inch
from reportlab.pdfbase import pdfmetrics
from reportlab.platypus import PageBreak, SimpleDocTemplate, Table, TableStyle

logging.basicConfig(level=logging.INFO, format="%(levelname)s %(message)s")
log = logging.getLogger(__name__)


# ---------------------------------------------------------------------------
# Stage 1: load and normalise
# ---------------------------------------------------------------------------

def load_attribute_table(path: str) -> pd.DataFrame:
    gdf = gpd.read_file(path)
    df = pd.DataFrame(gdf.drop(columns="geometry"))
    return df.astype(str).fillna("").reset_index(drop=True)


# ---------------------------------------------------------------------------
# Stage 2: row-height measurement
# ---------------------------------------------------------------------------

def compute_row_heights(
    data: list[list[str]],
    font_name: str,
    font_size: int,
    col_width: float,
    v_padding: int = 6,
) -> list[float]:
    heights: list[float] = []
    for row in data:
        max_h = float(font_size + v_padding)
        for cell in row:
            w = pdfmetrics.stringWidth(str(cell), font_name, font_size)
            lines = max(1, int(w / col_width) + 1)
            cell_h = float((font_size + 2) * lines + v_padding)
            max_h = max(max_h, cell_h)
        heights.append(max_h)
    return heights


# ---------------------------------------------------------------------------
# Stage 3: chunking
# ---------------------------------------------------------------------------

def chunk_by_height(
    data: list[list[str]],
    row_heights: list[float],
    top_margin: float,
    bottom_margin: float,
    header_height: float = 20.0,
) -> list[list[list[str]]]:
    available = letter[1] - top_margin - bottom_margin - header_height
    chunks: list[list[list[str]]] = []
    current: list[list[str]] = []
    running = 0.0
    for row, h in zip(data, row_heights):
        if running + h > available and current:
            chunks.append(current)
            current = []
            running = 0.0
        current.append(row)
        running += h
    if current:
        chunks.append(current)
    return chunks


# ---------------------------------------------------------------------------
# Stages 4 & 5: table building and PDF assembly
# ---------------------------------------------------------------------------

def build_chunk_table(
    headers: list[str],
    chunk_data: list[list[str]],
    font_name: str,
    font_size: int,
) -> Table:
    table_data = [headers] + chunk_data
    t = Table(table_data, repeatRows=1, splitByRow=0)
    t.setStyle(TableStyle([
        ("FONTNAME",       (0, 0), (-1, 0),  font_name + "-Bold"),
        ("FONTSIZE",       (0, 0), (-1, -1), font_size),
        ("BACKGROUND",     (0, 0), (-1, 0),  colors.HexColor("#1a3a4a")),
        ("TEXTCOLOR",      (0, 0), (-1, 0),  colors.white),
        ("ROWBACKGROUNDS", (0, 1), (-1, -1), [colors.white, colors.HexColor("#f4f7fa")]),
        ("GRID",           (0, 0), (-1, -1), 0.4, colors.HexColor("#c0c8d0")),
        ("VALIGN",         (0, 0), (-1, -1), "TOP"),
        ("TOPPADDING",     (0, 0), (-1, -1), 4),
        ("BOTTOMPADDING",  (0, 0), (-1, -1), 4),
        ("LEFTPADDING",    (0, 0), (-1, -1), 6),
        ("RIGHTPADDING",   (0, 0), (-1, -1), 6),
    ]))
    return t


def paginate_attribute_table(
    input_path: str,
    output_path: str,
    font_name: str = "Helvetica",
    font_size: int = 9,
    col_width: float = 120.0,
    top_margin: float = 1.0 * inch,
    bottom_margin: float = 1.0 * inch,
) -> str:
    log.info("Loading %s", input_path)
    df = load_attribute_table(input_path)
    headers = df.columns.tolist()
    data = df.values.tolist()
    log.info("Rows: %d  Columns: %d", len(data), len(headers))

    row_heights = compute_row_heights(data, font_name, font_size, col_width)
    chunks = chunk_by_height(data, row_heights, top_margin, bottom_margin)
    log.info("Pages: %d", len(chunks))

    doc = SimpleDocTemplate(output_path, pagesize=letter,
                            topMargin=top_margin, bottomMargin=bottom_margin)
    elements = []
    for idx, chunk_data in enumerate(chunks):
        elements.append(build_chunk_table(headers, chunk_data, font_name, font_size))
        if idx < len(chunks) - 1:
            elements.append(PageBreak())

    doc.build(elements)
    log.info("Written: %s", output_path)
    return output_path


# ---------------------------------------------------------------------------
# CLI
# ---------------------------------------------------------------------------

def _cli() -> None:
    parser = argparse.ArgumentParser(description="Paginate a GIS attribute table to PDF.")
    parser.add_argument("--input",      required=True,  help="Path to GeoJSON / shapefile / GPKG")
    parser.add_argument("--output",     required=True,  help="Output PDF path")
    parser.add_argument("--font-size",  type=int, default=9)
    parser.add_argument("--col-width",  type=float, default=120.0,
                        help="Estimated column width in points (for wrap calculation)")
    args = parser.parse_args()

    if not Path(args.input).exists():
        log.error("Input file not found: %s", args.input)
        sys.exit(1)

    paginate_attribute_table(
        input_path=args.input,
        output_path=args.output,
        font_size=args.font_size,
        col_width=args.col_width,
    )


if __name__ == "__main__":
    _cli()

Edge Cases and Advanced Configuration

Null and sparse attribute data

Spatial data commonly contains columns with very high null rates — e.g., optional administrative fields populated only in certain jurisdictions. Cast nulls to "" at Stage 1 (already done above) but also apply column-level filtering before measurement: drop columns where more than 95 % of rows are empty, or replace them with a placeholder such as "—". This prevents column-width calculations from being dominated by a handful of pathologically long strings in an otherwise sparse column.

Memory management for 100k+ row datasets

Loading a full 500k-row DataFrame before chunking triggers MemoryError on machines with less than 16 GB RAM. Replace batch loading with streaming:

Python

import pandas as pd

# Stream a CSV-backed attribute export in 5,000-row pages
CHUNK = 5_000
with pd.read_csv("attributes.csv", chunksize=CHUNK) as reader:
    for batch_df in reader:
        batch_df = batch_df.astype(str).fillna("")
        # measure + chunk + render each batch independently

For database-backed pipelines (PostGIS, SQLite, GeoPackage), use a LIMIT / OFFSET cursor or sqlalchemy streaming to pass rows through without materialising the full result set.

Parallel preprocessing, single-threaded rendering

Font-metric measurement (compute_row_heights) is CPU-bound and embarrassingly parallel. Offload it to concurrent.futures.ProcessPoolExecutor with the dataset split across workers. However, SimpleDocTemplate.build() must run on a single thread — ReportLab’s canvas and Platypus flowable system are not thread-safe.

Headless CI/CD environments

On headless Linux agents, confirm that the libcairo2 and libpango-1.0-0 system packages are installed before the job runs. Missing system fonts cause silent fallback to Courier, which invalidates all pre-computed stringWidth measurements because Courier has different character metrics than Helvetica. Add a pre-flight assertion:

Python

from reportlab.pdfbase import pdfmetrics

assert pdfmetrics.getFont("Helvetica"), "Helvetica metrics not registered"

Multi-format outputs

When the same attribute dataset must appear in both PDF and HTML formats (e.g., for interactive web reports alongside archival PDFs), separate the chunking logic from the rendering layer. The chunk_by_height function operates on raw row heights, which are PDF-specific. For HTML output, feed the same normalised DataFrame into a Jinja2 template with CSS break-inside: avoid on <tr> elements — this delegates pagination to the browser’s print engine. See Loop Mapping for Dynamic Attribute Tables for the Jinja2 iteration patterns that complement this approach.

Troubleshooting

Symptom	Likely cause	Resolution
Column headers missing after page 1	`repeatRows` not set on `Table` constructor	Add `repeatRows=1` to `Table(table_data, repeatRows=1)`
Rows sliced mid-cell at page boundary	`splitByRow` defaults to `1` (splitting allowed)	Set `splitByRow=0` on the `Table` constructor
Text overflows cell boundaries	`col_width` in measurement mismatches actual table column width	Pass the exact allocated column width in points; measure with `table._colWidths` after layout pass
Double grid lines at page seam	Two separate `Table` objects each draw their outer border	Apply `('LINEBELOW', (0,-1), (-1,-1), 0, colors.white)` to the last row of non-final chunks
MemoryError on large datasets	Full DataFrame materialised before chunking	Stream via `pandas.read_csv(chunksize=N)` or database cursor
Misaligned columns after font fallback	Missing system font; Courier substituted silently	Embed target font with `pdfmetrics.registerFont(TTFont(...))` before measurement

Detailed Guides in This Section

Preventing Table Row Splits Across PDF Page Breaks — exact splitByRow and CSS break-inside configurations that force atomic row rendering across every supported engine.

Dynamic Legend Injection for Variable Datasets — keep legend keys and tabular summaries synchronised across pages using the same chunk-aware rendering approach.
Automated Static Map Generation from GeoJSON — viewport extent and CRS normalisation that must complete before a map panel is embedded alongside the paginated table.
Loop Mapping for Dynamic Attribute Tables — Jinja2 iteration patterns for HTML-format attribute tables that complement the PDF pagination pipeline.
Chart-to-PDF Sync with Matplotlib — synchronise chart viewports and data slices with the same page boundaries used for attribute table chunks.
Up: Dynamic Map & Data Embedding Workflows

Deterministic chunking, explicit font measurement, and strict flow control transform attribute table pagination from a layout debugging exercise into a predictable engineering task. Applied consistently across a Dynamic Map & Data Embedding Workflows pipeline, these patterns ensure that compliance reports, infrastructure inventories, and cadastral audits render correctly at enterprise scale — first time, every time.

Explore this section