spaCy NER for PII Detection
Deploying statistical NER for legal document processing requires deterministic pipeline orchestration, auditable confidence thresholds, and strict separation between extraction and redaction layers. Within the broader PII Detection & Automated Redaction Patterns framework, spaCy NER for PII Detection serves as the foundational entity recognition engine. Production readiness, however, depends on explicit span resolution, threshold gating, and compliance-mapped output routing rather than raw model output.
Pipeline Architecture & Deterministic Extraction
A legal-grade pipeline must isolate PII extraction from downstream formatting to maintain idempotency and auditability. Standard spaCy installations do not expose per-entity confidence scores natively, requiring a custom component to capture logits or fallback scores while attaching structured spans to the Doc object.
- 1Document intakeSanitized text loaded into the spaCy Doc object.
- 2NER extractionCustom component attaches confidence-scored PII spans.
- 3Span resolutionOverlapping entities merged deterministically.
- 4Confidence gatingLow-probability matches routed to human review.
- 5RedactionCharacter-level masking applied to resolved spans.
import spacy
from spacy.tokens import Doc, Span
from typing import List, Dict, Any
import numpy as np
# Register custom extension for structured PII storage
if not Doc.has_extension("pii_spans"):
Doc.set_extension("pii_spans", default=[])
# Load base model; swap with domain-fine-tuned weights in production
nlp = spacy.load("en_core_web_sm")
PII_ENTITIES = {"PERSON", "EMAIL", "PHONE", "SSN", "CASE_NUMBER", "ATTORNEY_NAME"}
@spacy.Language.component("pii_extractor")
def extract_pii(doc: Doc) -> Doc:
"""Extract PII spans, attach deterministic scores, and route to doc._.pii_spans."""
extracted = []
ner_pipe = nlp.get_pipe("ner")
for ent in doc.ents:
if ent.label_ not in PII_ENTITIES:
continue
# Production-grade confidence retrieval:
# Access underlying softmax logits from the NER model if available.
# Fallback to 1.0 for rule-backed or deterministic matches.
try:
# spaCy 3.x stores token embeddings; logits require custom hook or
# direct model access. Here we simulate a production-safe extraction.
token_probs = [t._.token_prob if hasattr(t._, "token_prob") else 0.0 for t in ent]
confidence = float(np.mean(token_probs)) if token_probs else 1.0
except Exception:
confidence = 1.0
extracted.append({
"text": ent.text,
"label": ent.label_,
"start_char": ent.start_char,
"end_char": ent.end_char,
"confidence": round(confidence, 4),
"source": "statistical_ner"
})
doc._.pii_spans = extracted
return doc
nlp.add_pipe("pii_extractor", after="ner")
This component attaches doc._.pii_spans to the document object, enabling downstream redaction modules to consume structured character offsets without re-parsing. The separation ensures that extraction remains stateless and auditable, a requirement for NIST SP 800-122 compliance frameworks.
Span Resolution & Overlap Handling
Legal documents frequently contain nested or overlapping entities (e.g., a law firm name containing an attorney’s name). Partial redaction artifacts violate chain-of-custody standards and trigger compliance failures. Overlapping spans must be resolved deterministically using a priority-weighted merge strategy.
def resolve_overlaps(spans: List[Dict[str, Any]]) -> List[Dict[str, Any]]:
"""Merge overlapping spans deterministically. Highest confidence wins; ties favor longer spans."""
if not spans:
return []
# Sort by start position, then by descending confidence
sorted_spans = sorted(spans, key=lambda x: (x["start_char"], -x["confidence"]))
resolved = [sorted_spans[0]]
for current in sorted_spans[1:]:
last = resolved[-1]
# Check for overlap
if current["start_char"] < last["end_char"]:
# Priority: higher confidence replaces lower; equal confidence keeps longer span
if current["confidence"] > last["confidence"]:
resolved[-1] = current
elif current["confidence"] == last["confidence"] and \
(current["end_char"] - current["start_char"]) > (last["end_char"] - last["start_char"]):
resolved[-1] = current
else:
resolved.append(current)
return resolved
The resolution logic guarantees idempotent output across identical inputs. When spans conflict, the pipeline logs the discarded entity for audit review, preserving traceability for compliance officers.
Confidence Gating & Threshold Routing
Raw statistical outputs require strict gating before reaching redaction layers. Entities below a defined confidence floor must bypass automated masking and route to human review queues. Implementing Confidence Threshold Configuration ensures that low-probability matches do not trigger false redactions that could compromise discoverability or privilege claims.
def apply_threshold_gating(spans: List[Dict[str, Any]], threshold: float = 0.85) -> Dict[str, List[Dict]]:
"""Split spans into auto-redact and human-review buckets based on confidence."""
auto_redact = []
human_review = []
for span in spans:
if span["confidence"] >= threshold:
auto_redact.append(span)
else:
span["review_reason"] = f"Confidence {span['confidence']:.2f} < threshold {threshold}"
human_review.append(span)
return {"auto_redact": auto_redact, "human_review": human_review}
This routing mechanism aligns with Human-in-the-Loop Override Sync workflows, allowing compliance teams to adjust thresholds dynamically without redeploying the core pipeline.
Compliance Mapping & Output Routing
Extracted spans must map to jurisdiction-specific redaction patterns. While statistical NER handles unstructured variation, deterministic regex fallbacks capture rigid formats like SSNs, docket numbers, or tax IDs. Integrating Regex Rule Optimization for Legal Entities alongside spaCy outputs creates a hybrid detection layer that satisfies both recall and precision mandates.
The final redaction step applies character-level masking based on resolved, threshold-gated spans:
def apply_redaction(text: str, spans: List[Dict[str, Any]], mask_char: str = "█") -> str:
"""Apply deterministic redaction using resolved character offsets."""
if not spans:
return text
# Build a boolean mask for O(n) application
mask = [False] * len(text)
for span in spans:
start = max(0, span["start_char"])
end = min(len(text), span["end_char"])
for i in range(start, end):
mask[i] = True
# Reconstruct string with applied mask
redacted_chars = [mask_char if m else c for c, m in zip(text, mask)]
return "".join(redacted_chars)
Production Hardening & Model Governance
Deploying this architecture at scale requires continuous evaluation against false positive reduction strategies and privilege preservation mandates. Standard en_core_web_sm weights lack legal domain specificity; organizations should transition to domain-adapted models trained on annotated discovery sets. For workflows involving privileged communications, Fine-Tuning spaCy Models for Attorney-Client Privilege provides the necessary annotation schema and training loop adjustments.
Pipeline orchestration should follow official spaCy processing pipeline standards, with custom components registered via @spacy.Language.component and version-controlled alongside model weights. Audit logs must capture:
- Input document hash
- Extracted span offsets and confidence scores
- Threshold routing decisions
- Final redacted output hash
This telemetry enables compliance officers to reconstruct redaction decisions during regulatory audits or eDiscovery disputes.