Handling Scanned PDFs with OCR

Scanned legal documents arrive with no native text layer, so the redaction engine downstream has nothing to target: there are no glyph coordinates to black out, no selectable spans to detect, only pixels. Optical character recognition (OCR) reconstructs a searchable, coordinate-aligned text layer over the raster image, and in a regulated pipeline that reconstruction must be deterministic, audit-traceable, and confidence-gated before a single byte is released. This guide specifies how to rasterize, preprocess, OCR, and validate image-based PDFs so they enter the redaction stage with the same defensibility as native files.

Scanned-PDF OCR redaction pipeline

This work sits inside the broader PDF and DOCX Parsing & Extraction Workflows extraction layer. Native PDFs and DOCX payloads expose their own glyph positions; scanned PDFs do not, so OCR becomes the only source of the bounding boxes that redaction, detection, and audit all depend on. An engineer reaches this sub-problem the moment a deposition exhibit, a faxed court filing, or a microfilm-era contract enters the ingestion queue as a flat image — at which point trusting the OCR output blindly is the difference between a clean production and an unredacted-PII breach.

Where OCR sits in the pipeline Permalink to this section

OCR is not a single call; it is a deterministic sequence of stages, each of which must be reproducible from the archived source bytes. The recommended path rasterizes the page with PyMuPDF (fitz), preprocesses the raster, runs Tesseract for recognition, maps pixel boxes back into PDF point space, and gates every word on a confidence threshold before the searchable layer is written.

1RasterizeRender each page to a fixed-DPI pixmap with PyMuPDF for a deterministic pixel grid.
2PreprocessOtsu binarization, deskew, and denoise to stabilize recognition on low-contrast scans.
3RecognizeTesseract LSTM engine with pinned --psm/--oem flags emits word boxes and per-token confidence.
4Map & gateConvert pixel boxes to PDF points; discard low-confidence tokens, flag low-confidence pages.
5Overlay & signInject an invisible text layer, then hash source and output into the audit manifest.

Five deterministic stages turn a flat raster into a searchable, redaction-ready PDF with a verifiable confidence record.

Configuration reference Permalink to this section

OCR defaults are tuned for clean office scans, not for mixed typefaces, court stamps, low-contrast signatures, and handwritten margin notes. Pin every parameter explicitly so output is reproducible across batches and across Tesseract versions.

Parameter	Flag / field	Recommended value	Type	Notes
Render resolution	`dpi`	`300` (up to `400` for fine print)	int	Below 300 DPI, glyph edges blur and confidence collapses.
Page segmentation	`--psm`	`6` (uniform block); `3` (auto layout); `12` (sparse)	enum	Pin per document class; never leave on auto-detect for batches.
Engine mode	`--oem`	`1` (LSTM)	enum	Use `0` only for legacy typewriter or dot-matrix exhibits.
Language packs	`-l`	`eng` (`eng+fra+deu` for multilingual sets)	str	Load jurisdiction-specific packs to cut citation errors.
Custom dictionary	`--user-words`	path to legal glossary	path	Reduces hallucination on case citations and statute numbers.
Word-confidence floor	`MIN_WORD_CONF`	`75`	int	Tokens below this are discarded from the text layer.
Page-review floor	`PAGE_REVIEW_FLOOR`	`60`	int	Pages with a mean below this route to human review.
Text render mode	`render_mode`	`3` (invisible)	int	Selectable, non-printing OCR layer aligned to the raster.

The pixel-to-point conversion is the load-bearing piece of the whole stage. A page rendered at dpi produces a pixmap scaled by zoom = dpi / 72, so every pixel coordinate maps back to PDF point space by the inverse factor scale = 72 / dpi. Apply that single factor to left, top, width, and height and the OCR box lands exactly where the glyph sits on the original scan — no drift, no overlap.

Preprocessing Permalink to this section

Recognition quality is decided before Tesseract runs. Binarize to separate ink from a yellowed or shadowed background, deskew so rows are axis-aligned, and denoise to kill scanner speckle. Keep the transform deterministic — the same input must always yield the same preprocessed raster, or the audit trail cannot be reproduced.

import cv2
import numpy as np


def preprocess(image: np.ndarray) -> np.ndarray:
    """Deterministic binarize + deskew for legal scans (OpenCV BGR input)."""
    gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)

    # Otsu picks the threshold automatically, isolating text from a
    # low-contrast or discolored scan background.
    _, binary = cv2.threshold(
        gray, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU
    )

    # Estimate skew from the minimum-area rectangle of foreground pixels.
    coords = np.column_stack(np.where(binary < 255))
    angle = cv2.minAreaRect(coords)[-1]
    angle = -(90 + angle) if angle < -45 else -angle

    h, w = binary.shape
    rotation = cv2.getRotationMatrix2D((w // 2, h // 2), angle, 1.0)
    deskewed = cv2.warpAffine(
        binary, rotation, (w, h),
        flags=cv2.INTER_CUBIC,
        borderMode=cv2.BORDER_REPLICATE,
    )
    return deskewed

Verified implementation Permalink to this section

The core routine renders a page, runs Tesseract with image_to_data to capture word-level boxes and confidence, converts each box into PDF point space, and injects an invisible text layer. Every word below the confidence floor is dropped from the layer rather than silently trusted.

import io

import fitz  # PyMuPDF >= 1.23
import pytesseract
from PIL import Image
from pytesseract import Output

DPI = 300
MIN_WORD_CONF = 75               # discard tokens below this confidence
TESS_CONFIG = "--oem 1 --psm 6 -l eng"


def ocr_overlay_page(page: "fitz.Page", dpi: int = DPI) -> dict:
    """Render one page, OCR it, and inject a coordinate-aligned invisible
    text layer. Returns per-word metrics for the audit manifest."""
    zoom = dpi / 72.0
    scale = 72.0 / dpi           # pixel -> PDF point conversion factor
    pix = page.get_pixmap(matrix=fitz.Matrix(zoom, zoom), alpha=False)
    image = Image.open(io.BytesIO(pix.tobytes("png")))

    data = pytesseract.image_to_data(
        image, config=TESS_CONFIG, output_type=Output.DICT
    )

    placed, dropped, confidences = 0, 0, []
    for i in range(len(data["text"])):
        text = data["text"][i].strip()
        conf = float(data["conf"][i])
        if not text or conf < 0:          # skip empty cells and -1 sentinels
            continue
        if conf < MIN_WORD_CONF:
            dropped += 1
            continue

        # Pixel box -> PDF point box; both spaces share a top-left origin.
        x0 = data["left"][i] * scale
        y0 = data["top"][i] * scale
        box_h = data["height"][i] * scale
        baseline = fitz.Point(x0, y0 + box_h)

        page.insert_text(
            baseline,
            text,
            fontsize=box_h,
            render_mode=3,                # 3 = invisible (selectable, not drawn)
            color=(0, 0, 0),
        )
        confidences.append(conf)
        placed += 1

    mean_conf = round(sum(confidences) / len(confidences), 2) if confidences else 0.0
    return {
        "page": page.number + 1,
        "words_placed": placed,
        "words_dropped": dropped,
        "mean_confidence": mean_conf,
    }

The driver walks the document, records a SHA-256 of the source bytes before processing and of the output after, and flags any page whose mean confidence falls under the review floor. The manifest it returns is the chain-of-custody record for the OCR stage.

import hashlib
from pathlib import Path

PAGE_REVIEW_FLOOR = 60           # mean page confidence below this => human review


def build_searchable_pdf(src: Path, dst: Path) -> dict:
    raw = src.read_bytes()
    manifest = {
        "source_sha256": hashlib.sha256(raw).hexdigest(),
        "engine": str(pytesseract.get_tesseract_version()),
        "dpi": DPI,
        "tess_config": TESS_CONFIG,
        "pages": [],
        "flagged_pages": [],
    }

    doc = fitz.open(stream=raw, filetype="pdf")
    for page in doc:
        metrics = ocr_overlay_page(page)
        manifest["pages"].append(metrics)
        if metrics["mean_confidence"] < PAGE_REVIEW_FLOOR:
            manifest["flagged_pages"].append(metrics["page"])

    doc.save(str(dst), garbage=4, deflate=True)
    doc.close()

    manifest["output_sha256"] = hashlib.sha256(dst.read_bytes()).hexdigest()
    return manifest

Word-level confidence thresholds here mirror the gating logic detailed in Confidence Threshold Configuration for the detection layer — the OCR floor and the entity floor are separate gates, and a token must clear both before it reaches the released document. Once the searchable layer exists, named-entity detection with spaCy NER for PII Detection can run against the reconstructed text exactly as it would against a native parse.

Compliance alignment Permalink to this section

The manifest emitted above is what makes an OCR run defensible. Logging the source hash, engine version, DPI, segmentation flags, and per-page confidence satisfies the audit-logging and non-repudiation expectations of NIST SP 800-88 Compliance Mapping: the “Clear” action requires evidence that the logical content was correctly identified before it can be overwritten, and a recorded confidence distribution is that evidence. The dual source/output SHA-256 pair provides the AU-10 non-repudiation control — an auditor can re-run OCR months later against the archived source and confirm the same text layer, the same boxes, and the same hashes.

Because redaction targets the coordinates OCR produced, the bounding boxes must map 1:1 to the original raster, and the underlying image data must be removed at the content-stream level rather than merely covered — the same irreversible-erasure requirement that governs every stage of the document lifecycle security boundaries. A black rectangle drawn over a scanned signature is not redaction if the source pixels survive underneath.

Performance and scale Permalink to this section

At 300 DPI on a 4 vCPU worker, the stage is OCR-bound, not I/O-bound: PyMuPDF rasterization runs at roughly 30–60 pages/second, while Tesseract LSTM recognition dominates at 0.6–1.5 seconds per page depending on text density and --psm mode. A single worker therefore sustains around 2,500–5,000 pages/hour; a four-worker pool clears a 10,000-page discovery set in roughly 45–90 minutes. Memory is the constraint to watch — a 300 DPI A4 pixmap is ~25 MB decoded, and long-running Tesseract subprocesses leak Leptonica buffers, so cap each worker’s lifetime and recycle it. For high-volume ingestion, fan the per-page work out through the async batch processing pipelines so rasterization and recognition run as isolated, retryable tasks rather than a single blocking loop.

Engine and resolution choices move these numbers materially. Dropping from 400 to 300 DPI nearly halves recognition time at a small accuracy cost on fine print; --psm 6 is faster than --psm 3 when you already know the page is a single text block. Benchmark the rasterization path itself against the figures in pdfplumber vs PyMuPDF Performance before committing to a renderer, since the raster source feeds every downstream metric here.

Edge cases and failure handling Permalink to this section

OCR fails quietly far more often than it fails loudly, which is exactly why each failure mode needs an explicit, auditable outcome rather than a best-effort guess.

Mean page confidence below the review floor. The driver flags the page number in flagged_pages and the document must halt before automated redaction. Route it through Automated Fallback Routing for High-Risk Files so a reviewer adjudicates rather than the pipeline shipping an under-confident page. The human verdict syncs back through Human-in-the-Loop Override Sync.
Coordinate drift from rotation or crop metadata. If the source page carries a non-zero /Rotate value or a cropbox offset, pixel boxes map to the wrong PDF region and redaction misses. Normalize rotation with page.set_rotation(0) before rasterizing, or apply the page’s rotation matrix to the mapped coordinates so the box lands on the visible glyph.
Corrupted or non-image raster. A page that is neither extractable text nor a decodable image (truncated JPEG, zero-byte XObject) raises inside get_pixmap. Catch it, write a failure entry to the manifest with the source hash, and send the file to a secure fallback queue — see setting up secure fallback queues for failed redactions — rather than emitting a partial document.
Redaction-bypass risk from overlay-only removal. Even with perfect boxes, drawing opaque shapes leaves the source raster intact. Validation must confirm the underlying image bytes were stripped at save time (apply_redactions with image removal), not merely masked, before the file leaves the stage.

For engine-level configuration — DPI standardization, --psm/--oem matrices, hOCR parsing, and circuit breakers — work through Integrating Tesseract OCR into Redaction Pipelines, the detailed companion to this overview. To validate the rasterization path before you commit to it, compare engines in pdfplumber vs PyMuPDF Performance, and for the hybrid sets that mix scans with native Word files, align coordinate handling with python-docx Element Extraction. When OCR confidence is too low to trust, Automated Fallback Routing for High-Risk Files defines where the document goes next.

Part of the PDF and DOCX Parsing & Extraction Workflows section.

Frequently asked questions Permalink to this section

Why generate an invisible text layer instead of replacing the scanned image?

The original raster is the document of record — replacing it would destroy evidentiary fidelity and break the chain of custody. An invisible text layer (render mode 3) keeps the scanned pixels visible while making the recognized words selectable and addressable. Detection and redaction operate on that layer’s coordinates, but the image an auditor sees is still the untouched source until redaction explicitly strips bytes from it.

How do pixel coordinates from Tesseract map to PDF redaction boxes?

Render the page at a fixed DPI, which scales the pixmap by zoom = dpi / 72. Tesseract reports left, top, width, and height in that pixel space. Multiply each by the inverse factor scale = 72 / dpi to land in PDF point space, which shares the same top-left origin. Applying that single factor keeps the OCR box aligned to the glyph with no spatial drift, so the redaction box covers exactly the right region.

What confidence thresholds should gate an OCR run?

Use two separate floors. Discard individual words below ~75% confidence so garbage tokens never enter the searchable layer. Separately, compute each page’s mean confidence and flag any page under ~60% for human review rather than automated redaction. Both numbers, plus the per-word distribution, belong in the audit manifest so the gating decision is reproducible and defensible.

What happens when a scanned page fails OCR or scores too low?

It is never partially redacted and released. The driver records the source SHA-256 and the failure (exception or low mean confidence) in the manifest, then routes the file to a secure fallback queue for human adjudication. The reviewer’s decision syncs back into the pipeline, and the original raster is preserved in write-protected storage so nothing is lost.

Handling Scanned PDFs with OCR

Where OCR sits in the pipeline #Permalink to this section

Configuration reference #Permalink to this section

Preprocessing #Permalink to this section

Verified implementation #Permalink to this section

Compliance alignment #Permalink to this section

Performance and scale #Permalink to this section

Edge cases and failure handling #Permalink to this section

Related guides #Permalink to this section

Frequently asked questions #Permalink to this section

Related content