31 Mar 2026

Building a Privacy Gateway for German Lawyers

By Michael Mueller

GDPR

privacy

NLP

LLMs

legal tech

infrastructure

31 Mar 2026

German lawyers are bound by strict professional secrecy rules (BRAO §43a) and GDPR. They can't use cloud LLMs because their work — contracts, court filings, client communications — is full of personal data that can't leave their infrastructure. We built a self-hosted gateway that strips all PII from legal text before it reaches the cloud, then restores it in the response. The lawyer gets the full benefit of a frontier LLM without any client data crossing the network boundary.

This post covers the architecture, the PII detection stack, and how the whole thing runs on a single NVIDIA DGX Spark.

The problem

A lawyer pastes a Schriftsatz (client communication or similar) into a chat UI. The text contains names, birthdates, tax IDs, court case numbers, addresses, bank account details. If that text goes to a cloud API as-is, the lawyer has a compliance problem. Manual redaction is tedious and error-prone. The alternative is not using LLMs at all, which is increasingly impractical.

How it works

The pipeline has two phases. Phase one runs entirely on the local machine and handles anonymization. Phase two sends only the cleaned text to the cloud.

User input (German legal text)
  → Presidio + Flair NER (detect PII)
  → Anonymizer (replace with tokens: <PERSON_1>, <DE_STEUER_ID_1>, ...)
  → LLM Guard prompt injection check
  → User reviews anonymized text, can edit
  → Gemini API (sees only tokens, never real data)
  → De-anonymizer (restore original values in response)
  → User sees the answer with real names back in place

The mapping table (<PERSON_1> → "Thomas Müller") never leaves the server. It's stored in PostgreSQL, encrypted at rest. The cloud LLM only ever sees placeholder tokens.

PII detection: Flair beats spaCy for German legal text

We started with spaCy's de_core_news_lg model for named entity recognition. It missed names with titles ("Dr. Christian Schmidt"), names in formal letter headers, and compound names common in legal correspondence. Switching to Flair's ner-german-large model fixed most of these. Flair consistently scores 1.0 confidence on German person names that spaCy missed entirely.

The tradeoff is speed — Flair runs at about 200ms per sentence versus 50ms for spaCy. On the DGX Spark hardware, that's not noticeable in practice.

On top of Flair, we run 12 regex-based recognizers through Presidio for patterns that NER models don't catch:

Aktenzeichen — court case file numbers like 3 O 123/24 or VII ZR 45/23
Steuer-ID — 11-digit German tax identification numbers
Sozialversicherungsnummer — social security numbers
Handelsregister — commercial register entries (HRB, HRA)
Personalausweis — ID card numbers
Geburtsdatum — context-aware birthdate detection (only dates near keywords like "geb." or "Geburtsdatum" — contract dates pass through untouched)
Rechtsanwalt — lawyer registration numbers
Kontonummer, BIC, Vertragsnummer, Kundennummer — financial identifiers for bank statements and credit contracts

Each recognizer uses Presidio's context-boosting: a low base score that gets raised when relevant keywords appear nearby. This keeps false positives low. "12345678901" alone won't trigger the Steuer-ID recognizer, but "Steuer-ID: 12345678901" will.

Overlapping entities

One problem we hit early: Presidio's birthdate recognizer and the date-time recognizer both fire on the same text span. "geb. 15.03.1978" triggers DE_BIRTHDATE at score 0.9 and DATE_TIME at score 0.6. If you anonymize both, you get mangled tokens like <DE_BI<DATE_TIME_1>.

The fix is a deduplication pass before anonymization. Sort by start position, then by score descending. If two spans overlap, keep the higher-scored one.

def _deduplicate_overlapping(results):
    sorted_results = sorted(results, key=lambda r: (r.start, -r.score))
    deduplicated = []
    last_end = -1
    for result in sorted_results:
        if result.start >= last_end:
            deduplicated.append(result)
            last_end = result.end
    return deduplicated

Consistent tokens across conversation turns

The anonymizer maintains a session-scoped mapping table. If "Thomas Müller" appears in message 1 and gets assigned <PERSON_1>, it keeps that same token in message 3. This matters because the LLM needs to see a coherent conversation — if the same person gets a different token each turn, the model can't track who's who.

def get_or_create_token(original, entity_type, mapping):
    for token, value in mapping.items():
        if value == original and token.startswith(f"<{entity_type}_"):
            return token
    count = sum(1 for t in mapping if t.startswith(f"<{entity_type}_"))
    return f"<{entity_type}_{count + 1}>"

Prompt injection protection

We added LLM Guard's PromptInjection scanner to catch attempts at manipulating the LLM through crafted inputs. It runs a DeBERTa v3 classifier on the anonymized text and flags anything above a 0.92 confidence threshold. In testing, "Ignore all previous instructions" scores 1.0 and gets blocked. Normal legal text scores below 0.1.

The scanner is lazy-loaded — the model downloads on first use and stays in memory. It adds about 200ms to each request.

OCR for scanned documents

German law firms still deal with a lot of paper. Scanned PDFs are common — court filings, notarized documents, older contracts. We run Baidu's Qianfan-OCR (4B parameter vision-language model) through vLLM as a separate k8s pod. When PyPDF2 can't extract text from a PDF page, the page gets rendered as an image at 200 DPI via PyMuPDF and sent to the OCR service through its OpenAI-compatible API.

The OCR pod uses about 8GB of GPU memory at 30% utilization, leaving plenty of room on the DGX Spark's 128GB unified memory.

Legal source lookup

When the anonymized text contains references to German law — paragraph numbers like "§ 1605 BGB", case numbers, or legal keywords — the system searches for the actual legal text and injects it as context for Gemini. Sources:

gesetze-im-internet.de — we download and parse the XML exports for the 20 most common German laws (BGB, StGB, ZPO, FamFG, etc.) at startup. Paragraph lookups are instant from the in-memory index.
openlegaldata.io — REST API for court decisions. Searched by Aktenzeichen or keywords.
dejure.org — citation redirect endpoint for generating clickable links to court decisions. No content scraping (their ToS prohibit it).

The retrieved sources get prepended to the prompt with numbered references, and Gemini is instructed to cite them. The frontend renders a collapsible "Quellen" panel below each response with clickable links to the original sources.

The two-phase UX

We initially built a single-step flow: user sends message, everything runs, response appears. The problem was that Presidio sometimes gets things wrong — it might miss a name or tag a legal term as a person. Sending that to the cloud without human review defeats the purpose.

The current flow splits into two phases:

User sends text. Presidio runs (~1 second). A review card appears showing the anonymized text with a mapping table. The user can remove false positives (click X on an entity pill) or select missed text and add it manually through a floating popover.
User clicks "An LLM senden" with an optional prompt ("Fasse die Haftungsklauseln zusammen"). The anonymized text plus prompt go to Gemini. Response comes back de-anonymized.

This gives the lawyer full control over what leaves the machine.

Running on the DGX Spark

The whole stack runs on a single NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory, arm64) under k3s. Four pods:

Pod	What it does	Resources
backend	FastAPI, Presidio, Flair NER, LLM Guard	CPU + ~2GB for Flair model
frontend	React SPA served by nginx	Minimal
ocr	Qianfan-OCR 4B via vLLM	GPU, ~8GB VRAM
privacy-db	PostgreSQL via CloudNativePG	10GB storage

The backend image is built for linux/arm64 since the DGX Spark runs an Arm CPU. We hit this early — x86 images fail silently on k3s without useful error messages.

All k8s manifests use Kustomize with overlays for dev (CPU fallback, no GPU) and prod (GPU runtime class, DGX-specific settings).

What we'd do differently

The Flair false positive list is manual. We maintain a blocklist of German legal terms that Flair wrongly tags as PERSON — words like "Familienrechtliche" and "Barunterhalts". A better approach would be fine-tuning the NER model on a German legal corpus, but that requires annotated training data we don't have yet.

The vLLM OCR pod takes about 2 minutes to start cold (model download + CUDA graph capture). A readiness probe and some patience on first deploy would save debugging time.

We also considered using LLM Guard's Anonymize scanner to replace Presidio entirely. After digging into the API, we found it doesn't support registering custom EntityRecognizer subclasses — only regex patterns and a fixed NER model slot. So we kept Presidio for PII detection and use LLM Guard only for prompt injection.

Stack

Backend: Python 3.11, FastAPI, Presidio, Flair ner-german-large, LLM Guard
Frontend: Vite, React 19, TypeScript, Tailwind CSS 4
Cloud LLM: Gemini 2.5 Flash via Vertex AI
OCR: Qianfan-OCR 4B via vLLM
Database: PostgreSQL via CloudNativePG
Infrastructure: k3s, Kustomize, NVIDIA DGX Spark (arm64)

Resources

Table of Contents

The problem

How it works

PII detection: Flair beats spaCy for German legal text

Overlapping entities

Consistent tokens across conversation turns

Prompt injection protection

OCR for scanned documents

Legal source lookup

The two-phase UX

Running on the DGX Spark

What we'd do differently

Stack

Resources

Featured Blogs

Lore: Shared Context Infrastructure for Claude Code

Every developer on your team is loading context manually. We built Lore so Claude Code knows your org's conventions and remembers what happened yesterday, without anyone maintaining a giant CLAUDE.md.

Michael Mueller

The Autocompact Cliff: Why Your 200K Context Window Is Lying to You

Your 200K context window is not 200K of usable space. Here is what actually fills it, where the autocompact cliff sits, and how to stay below it.

Michael Czechowski

When Prompts Are Not Enough: A Field Guide to Reliability

What makes an AI agent reliable when the story depends on a past decision? A 35 turn campaign shows why memory, state, and enforcement have to live outside the model.

Loredana Moanga

See all articles

Free Resource

Master the AI Native Transformation

174 patterns, 422 pages — #1 Bestseller From Cloud Native to AI Native is FREE for a limited time

Get it For Free!Get it For Free!

Featured Blogs

Lore: Shared Context Infrastructure for Claude Code

Michael Mueller

The Autocompact Cliff: Why Your 200K Context Window Is Lying to You

Your 200K context window is not 200K of usable space. Here is what actually fills it, where the autocompact cliff sits, and how to stay below it.

Michael Czechowski

When Prompts Are Not Enough: A Field Guide to Reliability

What makes an AI agent reliable when the story depends on a past decision? A 35 turn campaign shows why memory, state, and enforcement have to live outside the model.

Loredana Moanga

See all articles

Free Resource

Master the AI Native Transformation

174 patterns, 422 pages — #1 Bestseller From Cloud Native to AI Native is FREE for a limited time

Get it For Free!Get it For Free!

Continue Exploring

A Pattern Language for Transformation

Browse our interactive library of 119 transformation patterns. Each one describes a specific architectural problem and a tested way to solve it, so your team can talk about real tradeoffs instead of abstract ideas.

Learn MoreLearn More

Free AI Assessment

Take our free diagnostic to see where you stand and get a 90-day plan telling you exactly what to fix first.

Learn MoreLearn More

Join Our Community

We organize and sponsor engineering events across Europe. Come meet the people building this stuff.

Learn MoreLearn More

Building a Privacy Gateway for German Lawyers

By Michael Mueller

The problem

How it works

PII detection: Flair beats spaCy for German legal text

Overlapping entities

Consistent tokens across conversation turns

Prompt injection protection

OCR for scanned documents

Legal source lookup

The two-phase UX

Running on the DGX Spark

What we'd do differently

Stack

Resources

Lore: Shared Context Infrastructure for Claude Code

Michael Mueller

The Autocompact Cliff: Why Your 200K Context Window Is Lying to You

Michael Czechowski

When Prompts Are Not Enough: A Field Guide to Reliability

Loredana Moanga

Master the AI Native Transformation

Lore: Shared Context Infrastructure for Claude Code

Michael Mueller

The Autocompact Cliff: Why Your 200K Context Window Is Lying to You

Michael Czechowski

When Prompts Are Not Enough: A Field Guide to Reliability

Loredana Moanga

Master the AI Native Transformation

You Might Also Like

A Pattern Language for Transformation

Free AI Assessment

Join Our Community

Quick Links

Waves of Innovation