Extracting Nested Tables from DOCX with python-docx

Q: Why does iterating document.tables flatten nested tables?

document.tables returns tables as a flat list of siblings and ignores the OpenXML containment hierarchy, so a child tbl inside a parent tc is skipped or bleeds into the parent cell. Recurse explicitly: for each tc, scan for descendant tbl nodes and extract them into their own node before serializing the parent cell text.

Q: How do I reconstruct merged cells from gridSpan and vMerge?

Read tcPr before mapping text. gridSpan marks a horizontal merge spanning n columns, and vMerge val=restart begins a vertical merge that vMerge val=continue cells extend. Resolve column widths from gridCol entries in tblGrid first, then apply span and merge states so each payload lands in the correct logical coordinate.

Q: How can I stop deep recursion from exhausting worker memory?

Cap the recursion depth with a configurable limit, enforce a per-pass payload size ceiling, and call gc.collect() between passes so intermediate lxml trees are reclaimed. In async batch workers this prevents one pathologically nested exhibit from blowing the heap and starving concurrent tasks.

Q: Why emit both logical grid coordinates and normalized bounding boxes from lxml?

DOCX logical layout does not map directly to PDF absolute positioning, so a mask from grid coordinates alone drifts once rendered. Emitting the logical grid position and a normalized bounding box lets downstream OCR and PDF layers align masks with sub-millimeter tolerance across the format boundary.

In legal document redaction and compliance automation, the deterministic parsing of nested tabular structures within DOCX files represents a critical control surface. Contracts, discovery exhibits, and regulatory schedules routinely embed child tables within parent cells, creating hierarchical XML structures that standard extraction pipelines routinely flatten or truncate. For SRE teams, compliance officers, and document automation engineers, the inability to resolve these nested boundaries directly compromises redaction accuracy, introduces privilege-log gaps, and triggers downstream pipeline failures. A production-grade extraction strategy must prioritize deterministic traversal, memory-bound recursion, and immutable audit logging over heuristic or layout-based parsing.

Root-Cause Analysis: The Flat-Iteration Failure Mode Permalink to this section

The primary failure vector in nested table extraction traces directly to python-docx’s default iteration model. By design, the library exposes tables as a flat list of sibling elements, ignoring the underlying OpenXML containment hierarchy. When a legal document contains a table within a table, the ECMA-376 specification dictates that the child <tbl> element is nested inside a parent <tc> (table cell) node. Standard row-by-row iteration bypasses this containment logic, causing cell payloads to bleed into adjacent columns or drop entirely during serialization.

This structural misalignment is the leading cause of false-positive reduction failures in automated redaction engines. When coordinate mapping drifts due to unaccounted child tables, privileged text is either incorrectly masked or, more critically, left exposed. Secure remediation requires a recursive traversal pattern that explicitly inspects each <tc> for descendant <tbl> nodes before advancing to the next row, ensuring strict parent-child isolation during payload extraction.

Grid Reconstruction & Boundary Isolation Permalink to this section

Legal documents frequently utilize merged cells (gridSpan and vMerge attributes), hidden formatting marks, and inline run-level styling that disrupts naive cell boundary detection. A robust extraction pipeline must parse the tblGrid and tcPr XML attributes to reconstruct the logical grid before mapping text payloads. Integrating this approach into a broader python-docx Element Extraction workflow ensures that cell-level metadata, including merge states and nested depth, is preserved alongside raw text. Without explicit grid reconstruction, compliance officers cannot verify that redaction masks align precisely with the intended data boundaries, creating unacceptable liability exposure.

The reconstruction process follows three deterministic steps:

Column Width Resolution: Extract <gridCol w="..."/> values from <tblGrid> to establish baseline column boundaries.
Merge State Evaluation: Parse tcPr for <gridSpan> (horizontal merge) and <vMerge val="restart|continue"/> (vertical merge) to map logical cell spans.
Child Table Isolation: Before serializing cell text, recursively scan the <tc> XML subtree for <tbl> descendants. If found, extract them into a separate hierarchical node rather than concatenating them into the parent cell’s string buffer.

Memory-Bound Recursion & Async Pipeline Integration Permalink to this section

Unbounded recursion over deeply nested tables can exhaust worker memory in async batch processing pipelines, while coordinate drift between DOCX logical layout and downstream OCR layers introduces spatial misalignment. Memory and OCR drift mitigation requires bounding the recursion depth, implementing explicit garbage collection, and enforcing strict payload size limits per extraction pass.

When integrating with PDF and DOCX Parsing & Extraction Workflows, coordinate mapping must remain consistent across file formats. DOCX logical coordinates do not translate directly to PDF absolute positioning. To prevent spatial misalignment in hybrid redaction pipelines, the extraction engine must emit both logical grid coordinates and normalized bounding boxes, allowing downstream OCR layers to apply masks with sub-millimeter tolerance.

Production Implementation Permalink to this section

The following implementation demonstrates a memory-safe, audit-logged recursive extractor. It leverages python-docx’s underlying lxml interface for direct XML traversal, enforces a configurable recursion depth, and reconstructs the logical grid before payload serialization.

import logging
from typing import Dict, List, Optional, Tuple
from docx import Document
from docx.oxml.ns import qn
from docx.oxml.table import CT_Tbl, CT_Tc
import hashlib
import gc

logger = logging.getLogger(__name__)

class NestedTableExtractor:
    def __init__(self, max_depth: int = 10, max_payload_bytes: int = 500_000):
        self.max_depth = max_depth
        self.max_payload_bytes = max_payload_bytes
        self.audit_log: List[Dict] = []

    def extract_document(self, docx_path: str) -> List[Dict]:
        doc = Document(docx_path)
        results = []
        for table in doc.tables:
            results.extend(self._extract_table(table, depth=0, parent_coords=None))
        self._flush_audit()
        gc.collect()
        return results

    def _extract_table(self, table, depth: int, parent_coords: Optional[Tuple[int, int]]) -> List[Dict]:
        if depth > self.max_depth:
            logger.warning(f"Recursion depth limit ({self.max_depth}) exceeded. Truncating nested table.")
            return []

        table_xml = table._tbl
        grid_cols = self._resolve_grid_widths(table_xml)
        rows_data = []

        for r_idx, row in enumerate(table.rows):
            row_data = []
            for c_idx, cell in enumerate(row.cells):
                cell_xml = cell._tc
                merge_info = self._parse_merge_state(cell_xml)
                # Concatenate text from all paragraphs in the cell
                cell_text = "\n".join(para.text for para in cell.paragraphs)

                # _Cell.tables returns properly-parented Table proxies for
                # each child <w:tbl> element inside this cell.
                nested_payloads = []
                for child_table in cell.tables:
                    nested_payloads.extend(
                        self._extract_table(child_table, depth + 1, (r_idx, c_idx))
                    )

                cell_record = {
                    "row": r_idx,
                    "col": c_idx,
                    "text": cell_text.strip(),
                    "merge": merge_info,
                    "grid_width": grid_cols[c_idx] if c_idx < len(grid_cols) else None,
                    "depth": depth,
                    "nested_tables": nested_payloads
                }
                row_data.append(cell_record)
            rows_data.append(row_data)

        table_hash = hashlib.sha256(str(rows_data).encode()).hexdigest()
        self.audit_log.append({
            "event": "table_extracted",
            "table_hash": table_hash,
            "depth": depth,
            "rows": len(rows_data),
            "cols": len(grid_cols)
        })
        return rows_data

    def _resolve_grid_widths(self, table_xml: CT_Tbl) -> List[int]:
        grid = table_xml.find(qn('w:tblGrid'))
        if grid is None:
            return []
        return [int(col.get(qn('w:w'), 0)) for col in grid.findall(qn('w:gridCol'))]

    def _parse_merge_state(self, cell_xml: CT_Tc) -> Dict[str, Optional[str]]:
        tc_pr = cell_xml.find(qn('w:tcPr'))
        if tc_pr is None:
            return {"gridSpan": None, "vMerge": None}

        grid_span = tc_pr.find(qn('w:gridSpan'))
        v_merge = tc_pr.find(qn('w:vMerge'))

        return {
            "gridSpan": grid_span.get(qn('w:val')) if grid_span is not None else None,
            "vMerge": v_merge.get(qn('w:val')) if v_merge is not None else None
        }

    def _flush_audit(self):
        logger.info(f"Audit log flushed. {len(self.audit_log)} extraction events recorded.")
        self.audit_log.clear()

Compliance Validation & Audit Logging Permalink to this section

In regulated environments, extraction is only half the control framework. The remaining half is verifiable auditability. Every extraction pass must emit cryptographic hashes of the parsed structure, enabling compliance officers to validate that redaction masks applied in downstream systems correspond exactly to the extracted logical grid. Immutable audit logs should capture recursion depth, merge states, and payload sizes to satisfy internal review requirements and external regulatory examinations.

When deploying this extractor in production, enforce strict schema validation against the Office Open XML File Formats specification to reject malformed or intentionally obfuscated documents. Pair the extraction pipeline with automated coordinate verification against rendered PDF outputs, ensuring that DOCX logical boundaries translate accurately to physical redaction zones. For teams scaling across high-volume discovery workflows, reference the official python-docx documentation for XML namespace handling and memory optimization patterns.

Deterministic nested table extraction is not a parsing convenience; it is a compliance prerequisite. By enforcing recursive boundary isolation, reconstructing logical grids, and bounding memory consumption, legal technology teams can eliminate coordinate drift, close privilege-log gaps, and maintain audit-ready extraction pipelines at scale.

Frequently asked questions Permalink to this section

Why does iterating document.tables flatten nested tables?

document.tables returns tables as a flat list of siblings and ignores the OpenXML containment hierarchy, so a child <tbl> living inside a parent <tc> (table cell) is either skipped or its text bleeds into the parent cell. The fix is to recurse explicitly: for each <tc>, scan the XML subtree for descendant <tbl> nodes and extract them into their own hierarchical node before serializing the parent cell’s text.

How do I reconstruct merged cells from gridSpan and vMerge?

Read the layout from tcPr before mapping any text. A <gridSpan w:val="n"/> marks a horizontal merge spanning n logical columns, and <vMerge w:val="restart"/> begins a vertical merge that subsequent <vMerge w:val="continue"/> cells extend. Resolve baseline column widths from the <gridCol> entries in <tblGrid> first, then apply the span and merge states so each payload lands in the correct logical coordinate rather than drifting into an adjacent column.

How can I stop deep recursion from exhausting worker memory?

Bound it on three axes: cap the recursion depth with a configurable limit, enforce a per-pass payload size ceiling, and trigger explicit gc.collect() between extraction passes so intermediate lxml trees are reclaimed promptly. In async batch workers this keeps a single pathologically nested exhibit from blowing the heap and starving the other concurrent extraction tasks.

Why emit both logical grid coordinates and normalized bounding boxes from lxml?

DOCX logical layout does not translate directly to PDF absolute positioning, so a redaction mask derived from grid coordinates alone drifts spatially once the document is rendered. Emitting the logical grid position and a normalized bounding box lets downstream OCR and PDF layers align masks with sub-millimeter tolerance, which is what keeps nested-table redactions provably correct across the format boundary.

Extracting Nested Tables from DOCX with python-docx

Root-Cause Analysis: The Flat-Iteration Failure Mode #Permalink to this section

Grid Reconstruction & Boundary Isolation #Permalink to this section

Memory-Bound Recursion & Async Pipeline Integration #Permalink to this section

Production Implementation #Permalink to this section

Compliance Validation & Audit Logging #Permalink to this section

Frequently asked questions #Permalink to this section