spaCy NER for PII Detection

Deploying statistical NER for legal document processing requires deterministic pipeline orchestration, auditable confidence thresholds, and strict separation between extraction and redaction layers. Within the broader PII Detection & Automated Redaction Patterns framework, spaCy NER for PII Detection serves as the foundational entity recognition engine. Production readiness, however, depends on explicit span resolution, threshold gating, and compliance-mapped output routing rather than raw model output.

spaCy NER detection flow

Pipeline Architecture & Deterministic Extraction Permalink to this section

A legal-grade pipeline must isolate PII extraction from downstream formatting to maintain idempotency and auditability. Standard spaCy models expose entity types through ent.label_, but the default English models (en_core_web_sm/md/lg) only recognize general-purpose labels such as PERSON, ORG, DATE, GPE, and CARDINAL. Legal-specific labels like ATTORNEY_NAME, CASE_NUMBER, or SSN require either a domain-fine-tuned model or a custom pipeline component that handles those types separately (e.g., via regex). The architecture below uses a custom component that applies structured scoring to whatever entities the NER component produces, routing them through a compliance-aware schema.

1Document intakeSanitized text loaded into the spaCy Doc object.
2NER extractionCustom component attaches confidence-scored PII spans.
3Span resolutionOverlapping entities merged deterministically.
4Confidence gatingLow-probability matches routed to human review.
5RedactionCharacter-level masking applied to resolved spans.

Extraction stays separate from redaction — keeping the pipeline idempotent and auditable.

import spacy
from spacy.tokens import Doc
from typing import List, Dict, Any

# Register custom extension for structured PII storage
if not Doc.has_extension("pii_spans"):
    Doc.set_extension("pii_spans", default=[])

# Load base model. In production, swap for a domain-fine-tuned model that
# exposes legal-specific entity labels (ATTORNEY_NAME, CASE_NUMBER, etc.).
# Standard en_core_web_sm recognizes PERSON, ORG, DATE, GPE, CARDINAL, etc.
nlp = spacy.load("en_core_web_sm")

# Labels from the loaded model that map to PII categories.
# Adjust this set when using a domain-fine-tuned model with legal labels.
PII_ENTITY_LABELS = {"PERSON", "ORG", "DATE", "GPE"}

@spacy.Language.component("pii_extractor")
def extract_pii(doc: Doc) -> Doc:
    """
    Extract PII spans from NER output, attach deterministic scores, and
    route to doc._.pii_spans.

    spaCy v3 does not expose per-entity softmax logits through the public API.
    For production confidence scores, fine-tune the model and capture logits
    via a custom listener or use a scorer that wraps the underlying Thinc model.
    This implementation assigns a baseline score of 1.0 for all NER matches;
    override that with real probabilities from a fine-tuned model in production.
    """
    extracted = []
    for ent in doc.ents:
        if ent.label_ not in PII_ENTITY_LABELS:
            continue

        extracted.append({
            "text": ent.text,
            "label": ent.label_,
            "start_char": ent.start_char,
            "end_char": ent.end_char,
            # Placeholder: replace with real logit-derived probability when
            # using a fine-tuned model that exposes per-entity scores.
            "confidence": 1.0,
            "source": "statistical_ner"
        })

    doc._.pii_spans = extracted
    return doc

nlp.add_pipe("pii_extractor", after="ner")

This component attaches doc._.pii_spans to the document object, enabling downstream redaction modules to consume structured character offsets without re-parsing. The separation ensures that extraction remains stateless and auditable, a requirement for NIST SP 800-122 compliance frameworks.

Span Resolution & Overlap Handling Permalink to this section

Legal documents frequently contain nested or overlapping entities (e.g., a law firm name containing an attorney’s name). Partial redaction artifacts violate chain-of-custody standards and trigger compliance failures. Overlapping spans must be resolved deterministically using a priority-weighted merge strategy.

def resolve_overlaps(spans: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
    """Merge overlapping spans deterministically. Highest confidence wins; ties favor longer spans."""
    if not spans:
        return []

    # Sort by start position, then by descending confidence
    sorted_spans = sorted(spans, key=lambda x: (x["start_char"], -x["confidence"]))
    resolved = [sorted_spans[0]]

    for current in sorted_spans[1:]:
        last = resolved[-1]
        # Check for overlap
        if current["start_char"] < last["end_char"]:
            # Priority: higher confidence replaces lower; equal confidence keeps longer span
            if current["confidence"] > last["confidence"]:
                resolved[-1] = current
            elif current["confidence"] == last["confidence"] and \
                 (current["end_char"] - current["start_char"]) > (last["end_char"] - last["start_char"]):
                resolved[-1] = current
        else:
            resolved.append(current)

    return resolved

The resolution logic guarantees idempotent output across identical inputs. When spans conflict, the pipeline logs the discarded entity for audit review, preserving traceability for compliance officers.

Confidence Gating & Threshold Routing Permalink to this section

Raw statistical outputs require strict gating before reaching redaction layers. Entities below a defined confidence floor must bypass automated masking and route to human review queues. Implementing Confidence Threshold Configuration ensures that low-probability matches do not trigger false redactions that could compromise discoverability or privilege claims.

def apply_threshold_gating(spans: List[Dict[str, Any]], threshold: float = 0.85) -> Dict[str, List[Dict]]:
    """Split spans into auto-redact and human-review buckets based on confidence."""
    auto_redact = []
    human_review = []

    for span in spans:
        if span["confidence"] >= threshold:
            auto_redact.append(span)
        else:
            span["review_reason"] = f"Confidence {span['confidence']:.2f} < threshold {threshold}"
            human_review.append(span)

    return {"auto_redact": auto_redact, "human_review": human_review}

This routing mechanism aligns with Human-in-the-Loop Override Sync workflows, allowing compliance teams to adjust thresholds dynamically without redeploying the core pipeline.

Compliance Mapping & Output Routing Permalink to this section

Extracted spans must map to jurisdiction-specific redaction patterns. While statistical NER handles unstructured variation, deterministic regex fallbacks capture rigid formats like SSNs, docket numbers, or tax IDs. Integrating Regex Rule Optimization for Legal Entities alongside spaCy outputs creates a hybrid detection layer that satisfies both recall and precision mandates.

The final redaction step applies character-level masking based on resolved, threshold-gated spans:

def apply_redaction(text: str, spans: List[Dict[str, Any]], mask_char: str = "█") -> str:
    """Apply deterministic redaction using resolved character offsets."""
    if not spans:
        return text

    # Build a boolean mask for O(n) application
    mask = [False] * len(text)
    for span in spans:
        start = max(0, span["start_char"])
        end = min(len(text), span["end_char"])
        for i in range(start, end):
            mask[i] = True

    # Reconstruct string with applied mask
    redacted_chars = [mask_char if m else c for c, m in zip(text, mask)]
    return "".join(redacted_chars)

Production Hardening & Model Governance Permalink to this section

Deploying this architecture at scale requires continuous evaluation against privilege preservation mandates. Standard en_core_web_sm weights lack legal domain specificity; organizations should transition to domain-adapted models trained on annotated discovery sets. For workflows involving privileged communications, Fine-Tuning spaCy Models for Attorney-Client Privilege provides the necessary annotation schema and training loop adjustments.

Pipeline orchestration should follow official spaCy processing pipeline standards, with custom components registered via @spacy.Language.component and version-controlled alongside model weights. Audit logs must capture:

Input document hash
Extracted span offsets and confidence scores
Threshold routing decisions
Final redacted output hash

This telemetry enables compliance officers to reconstruct redaction decisions during regulatory audits or eDiscovery disputes.

spaCy NER for PII Detection

Pipeline Architecture & Deterministic Extraction #Permalink to this section

Span Resolution & Overlap Handling #Permalink to this section

Confidence Gating & Threshold Routing #Permalink to this section

Compliance Mapping & Output Routing #Permalink to this section

Production Hardening & Model Governance #Permalink to this section

Related content

Pipeline Architecture & Deterministic Extraction Permalink to this section

Span Resolution & Overlap Handling Permalink to this section

Confidence Gating & Threshold Routing Permalink to this section

Compliance Mapping & Output Routing Permalink to this section

Production Hardening & Model Governance Permalink to this section