Iterating Through Shapefile Attributes in ReportLab

Q: Why does pyshp return bytes instead of strings for some fields?

Legacy .dbf files store text in cp1252 or latin1 without a BOM. pyshp falls back to raw bytes when it cannot decode a field value with the declared encoding. Always pass encoding='cp1252' explicitly and decode with errors='replace' to avoid this.

Q: When should I use LongTable instead of Table in ReportLab?

Use LongTable for datasets exceeding roughly 5,000–10,000 rows. LongTable streams rows to the PDF engine in chunks, keeping peak memory flat rather than loading the full dataset at once.

Q: Can I use geopandas instead of pyshp to feed ReportLab?

Yes, but geopandas loads full geometry objects into memory even when you only need .dbf fields. For pure attribute extraction pyshp is lighter. Use geopandas when CRS reprojection, spatial joins, or geometry validation must precede table generation.

Iterating through shapefile attributes in ReportLab requires a two-stage pipeline: first parse the .shp/.dbf vector data using a dedicated GIS library, then feed the cleaned Python-native records into ReportLab’s Table or LongTable flowables. This technique is the foundational data-extraction step inside the broader Loop Mapping for Dynamic Attribute Tables workflow, sitting between raw vector storage and the layout engine. ReportLab has no native geospatial format support, so all attribute iteration, type coercion, and encoding normalization must happen in memory before the data reaches the PDF rendering stage.

Prerequisites

Python 3.10+ — predictable zip() behavior, modern type hints, and stable pathlib integration.
pyshp 2.3.0+ — install with pip install pyshp. The encoding parameter added in 2.3 prevents silent character corruption on legacy DBF files.
reportlab 3.6.12+ — install with pip install reportlab. The Table pagination fix for datasets over 5,000 rows landed in this release; earlier builds can leak memory.
geopandas / fiona 1.9.0+ / 0.14+ — only needed when CRS reprojection, geometry validation, or spatial joins must occur before table generation. For pure attribute extraction pyshp is lighter.
A shapefile set (.shp, .dbf, .prj) on a local or mounted path. Remote files should be fetched and written to disk before parsing.

No Jinja2 dependency is required at this stage; the sanitized Python list-of-lists is passed directly to ReportLab. If your pipeline also drives an HTML intermediate layer, see Jinja2 Templating & Theme Logic for the template-engine integration point.

Step-by-Step Implementation

Step 1 — Open the shapefile with explicit encoding

pyshp’s shapefile.Reader accepts an encoding keyword. Always declare it explicitly; the default utf-8 assumption silently corrupts fields stored in cp1252 or latin1, which covers the majority of shapefiles produced by Windows GIS tooling.

Python

import shapefile

with shapefile.Reader("data/municipalities.shp", encoding="cp1252") as sf:
    # sf.fields contains the DBF field descriptors
    # sf.iterRecords() yields one namedtuple per feature row
    print(sf.fields)  # [['DeletionFlag', 'C', 1, 0], ['NAME', 'C', 80, 0], ...]

The context manager guarantees the file handle is closed after the with block even when iteration raises an exception mid-dataset.

Step 2 — Extract field names from the DBF metadata

sf.fields always starts with the mandatory DeletionFlag descriptor at index 0. Slice it away before building the header row:

Python

fields: list[str] = [f[0] for f in sf.fields[1:]]
# Example result: ['NAME', 'AREA_HA', 'ZONE_CODE', 'POP_2021']

Each element in sf.fields is a four-element list: [name, type_code, length, decimal_count]. The type code (C, N, F, D, L) tells you what Python type pyshp will return — strings, integers/floats, dates, or booleans. Knowing this lets you apply targeted type coercion in the next step.

Step 3 — Iterate records and sanitize values

ReportLab’s text engine expects clean Unicode strings. Raw bytes objects, None, and numeric types must all be normalized before they reach the Table constructor. Use a small helper function applied in a single list comprehension:

Python

def safe_str(val: object) -> str:
    """Convert any GIS attribute value to a UTF-8-safe string."""
    if val is None:
        return ""
    if isinstance(val, bytes):
        return val.decode("utf-8", errors="replace").strip()
    return str(val).strip()

table_data: list[list[str]] = [fields]  # header row first

with shapefile.Reader("data/municipalities.shp", encoding="cp1252") as sf:
    fields = [f[0] for f in sf.fields[1:]]
    table_data = [fields]
    for i, record in enumerate(sf.iterRecords()):
        if i >= 5000:          # guard against unbounded memory growth
            break
        table_data.append([safe_str(v) for v in record])

sf.iterRecords() is a generator — it reads one row at a time without loading the full .dbf into memory, which matters for large municipal or cadastral datasets.

Step 4 — Construct the ReportLab Table flowable

Pass the complete list-of-lists to reportlab.platypus.Table. Set repeatRows=1 so the header row reprints on every page break. Column widths are estimated from header label length as a reasonable fallback; production pipelines should measure actual data widths or read from a config:

Python

from reportlab.platypus import Table, TableStyle
from reportlab.lib import colors

# Estimate column widths from header label length (units: points; 1 mm ≈ 2.83 pt)
col_widths: list[float] = [min(len(f) * 4 + 20, 120) for f in fields]

table = Table(table_data, colWidths=col_widths, repeatRows=1)

table.setStyle(TableStyle([
    # Header row styling
    ("BACKGROUND",    (0, 0), (-1,  0), colors.HexColor("#2C3E50")),
    ("TEXTCOLOR",     (0, 0), (-1,  0), colors.white),
    ("ALIGN",         (0, 0), (-1,  0), "CENTER"),
    ("FONTNAME",      (0, 0), (-1,  0), "Helvetica-Bold"),
    ("BOTTOMPADDING", (0, 0), (-1,  0), 6),
    ("TOPPADDING",    (0, 0), (-1,  0), 6),
    # Body rows
    ("FONTSIZE",      (0, 0), (-1, -1), 9),
    ("VALIGN",        (0, 0), (-1, -1), "MIDDLE"),
    ("GRID",          (0, 0), (-1, -1), 0.5, colors.grey),
    # Alternating row backgrounds
    ("ROWBACKGROUNDS", (0, 1), (-1, -1),
     [colors.white, colors.HexColor("#F1F3F5")]),
]))

The ROWBACKGROUNDS command replaces per-row BACKGROUND directives — it is significantly faster for datasets over a few hundred rows.

Step 5 — Assemble the document and write the PDF

Attach the table to a SimpleDocTemplate flow. ReportLab handles page breaks and margin constraints automatically:

Python

from reportlab.lib.pagesizes import A4
from reportlab.lib.units import mm
from reportlab.lib.styles import getSampleStyleSheet
from reportlab.platypus import SimpleDocTemplate, Paragraph, Spacer

def build_shapefile_report(shp_path: str, output_pdf: str, max_rows: int = 5000) -> None:
    """Parse shapefile attributes and write a paginated PDF attribute table."""
    with shapefile.Reader(shp_path, encoding="cp1252") as sf:
        fields = [f[0] for f in sf.fields[1:]]
        table_data = [fields]
        for i, record in enumerate(sf.iterRecords()):
            if i >= max_rows:
                break
            table_data.append([safe_str(v) for v in record])

    col_widths = [min(len(f) * 4 + 20, 120) for f in fields]
    table = Table(table_data, colWidths=col_widths, repeatRows=1)
    table.setStyle(TableStyle([
        ("BACKGROUND",     (0, 0), (-1,  0), colors.HexColor("#2C3E50")),
        ("TEXTCOLOR",      (0, 0), (-1,  0), colors.white),
        ("ALIGN",          (0, 0), (-1,  0), "CENTER"),
        ("FONTNAME",       (0, 0), (-1,  0), "Helvetica-Bold"),
        ("FONTSIZE",       (0, 0), (-1, -1), 9),
        ("BOTTOMPADDING",  (0, 0), (-1,  0), 6),
        ("TOPPADDING",     (0, 0), (-1,  0), 6),
        ("VALIGN",         (0, 0), (-1, -1), "MIDDLE"),
        ("GRID",           (0, 0), (-1, -1), 0.5, colors.grey),
        ("ROWBACKGROUNDS", (0, 1), (-1, -1),
         [colors.white, colors.HexColor("#F1F3F5")]),
    ]))

    doc = SimpleDocTemplate(
        output_pdf,
        pagesize=A4,
        rightMargin=15 * mm,
        leftMargin=15 * mm,
        topMargin=15 * mm,
        bottomMargin=15 * mm,
    )
    styles = getSampleStyleSheet()
    doc.build([
        Paragraph("Shapefile Attribute Report", styles["Title"]),
        Spacer(1 * mm, 6 * mm),
        table,
    ])

# Usage
build_shapefile_report("data/municipalities.shp", "output/report.pdf")

Key Parameters / Configuration Reference

Parameter	Type	Default	Effect
`encoding`	`str`	`"utf-8"`	DBF character encoding. Use `"cp1252"` for Windows-origin files.
`max_rows`	`int`	`5000`	Hard cap on rows loaded into memory. Raise for completeness; lower for previews.
`repeatRows`	`int`	`1`	Number of leading rows reprinted on each new page. Keep at `1` for a sticky header.
`colWidths`	`list[float]`	auto	Per-column width in points. `None` lets ReportLab auto-size, which is slower on wide tables.
`splitByRow`	`bool`	`True`	Allow a row to split across page boundaries. Set `False` only when all cells are single-line to speed up layout calculations.
`errors` (decode)	`str`	`"replace"`	Byte-decode error mode. `"replace"` inserts U+FFFD; `"ignore"` drops corrupted characters.

Common Pitfalls

UnicodeEncodeError during doc.build(): ReportLab’s PDF string encoder rejects raw bytes objects. This happens when safe_str() is not applied to every cell value, or when a field that pyshp typed as numeric is passed without str() conversion. Apply the helper to every value in every row without exception.
Silent field truncation at 11 characters: The DBF format caps field names at 10 characters. pyshp returns the name as stored; if you control the source data, use descriptive abbreviations. When displaying headers in the table, consider a separate display-name lookup dict mapped from the short DBF field name.
Blank pages at the end of the PDF: This occurs when the last Spacer or Paragraph in the flowable list pushes past a page boundary with no content following it. Remove trailing Spacer elements or add a KeepTogether around the final group.
Table constructor hanging on wide datasets: If colWidths is None and the table has many columns, ReportLab measures every cell to determine widths. Pre-computing widths from the header (as shown above) or setting a fixed uniform width eliminates this bottleneck entirely.

Verification

Run this assertion block after writing the PDF to confirm the output is structurally valid before committing it to a delivery pipeline:

Python

import os
from reportlab.lib.utils import ImageReader

output = "output/report.pdf"
assert os.path.exists(output), "PDF was not created"
assert os.path.getsize(output) > 1024, "PDF is suspiciously small — likely empty"

# Optional: confirm page count using pypdf (pip install pypdf)
from pypdf import PdfReader
reader = PdfReader(output)
row_count = len(table_data) - 1  # subtract header
expected_min_pages = max(1, row_count // 60)  # rough estimate at ~60 rows/page
assert len(reader.pages) >= expected_min_pages, (
    f"Expected at least {expected_min_pages} page(s), got {len(reader.pages)}"
)
print(f"OK — {len(reader.pages)} page(s), {row_count} data rows")

For CI environments, add this check as a post-build step alongside the table pagination strategies for large attribute tables validation pattern to catch regressions before artifacts are published.

FAQ

Why does pyshp return bytes instead of strings for some fields?

Legacy .dbf files store text in cp1252 or latin1 without a byte-order mark. pyshp falls back to raw bytes when it cannot decode a field value with the declared encoding. Always pass encoding="cp1252" explicitly and decode with errors="replace" in your sanitizer to avoid this.

When should I use LongTable instead of Table?

Use reportlab.platypus.LongTable for datasets exceeding roughly 5,000–10,000 rows. LongTable streams rows to the PDF engine in chunks, keeping peak memory flat rather than loading the full dataset into a single in-memory structure. Pair it with sf.iterRecords() and a generator pipeline to keep both the parse and layout stages streaming.

Can I use geopandas instead of pyshp to feed ReportLab?

Yes, but geopandas loads full geometry objects into memory even when you only need .dbf fields. For pure attribute extraction pyshp is lighter. Use geopandas when CRS reprojection, spatial joins, or geometry validation must precede table generation; the Dynamic Map Data Embedding Workflows section covers pipelines where geometry and attributes are processed together.

Loop Mapping for Dynamic Attribute Tables — parent guide covering the full template-binding workflow of which this attribute-extraction step is stage one
Preventing Table Row Splits Across PDF Page Breaks — companion technique for controlling how ReportLab breaks large attribute tables across pages
Table Pagination Strategies for Large Attribute Tables — covers LongTable, chunked rendering, and memory budgeting for high-row-count datasets
Using Jinja2 if-else Blocks to Hide Empty GIS Layers — conditional rendering pattern to suppress table sections when attribute data is absent