We audited 72 "redacted" PDFs. 17% still leak text — and 31% of federal ones do.

Essex Software research · 2026-06-07 · updated 2026-06-08

PDF redaction has been a solved problem for two decades. Acrobat has a true-removal redact tool. So does every serious legal-tech vendor. So does the open-source toolchain we used for this study. And yet, in 2019 Paul Manafort's sealed court filing leaked because someone drew black rectangles instead of redacting. In 2021 the EU Commission's AstraZeneca contract leaked because someone left the document bookmarks intact. The pattern keeps happening because looking redacted and being redacted are two different things — and almost nobody verifies the difference.

Updated 2026-06-08: Our original detector (published 2026-06-07) only caught near-black redaction rectangles, missing dark-grey overlays with mild anti-aliasing. The updated detector catches uniform-fill rectangles of any color while filtering out small icons, bullet points, and short snippets to suppress false positives. Re-running it on the same 72 PDFs raised the critical-leak rate from 13% to 17%. The federal-agency rate went from 19% to 31%. Methodology note at the bottom describes the change.

TL;DR

We collected 72 publicly indexed PDFs whose titles or URLs contained the word "redacted" — court filings, FOIA releases, government investigation reports, regulatory complaints, contracts.
63 audited successfully. 11 of them (17%) still have selectable text physically present under the redaction rectangles. 558 characters of supposed-to-be-redacted text were recoverable across those 11 documents.
59 of 63 (94%) leaked some form of identifying metadata — author name, document title, software trail, bookmark titles. Lower-severity than text-under-the-bars, but still a leak.
The most-affected publisher category was US federal agencies — nearly 1 in 3 had recoverable text under the rectangles.
Zero of the 63 audited documents had unapplied /Redact annotations or embedded original-source files. The failure mode is overwhelmingly "drew an opaque shape on top of selectable text, didn't actually delete the text."

The headline numbers

72PDFs collected via public Google search

63successfully audited (9 returned HTTP errors)

11had text recoverable under redaction rectangles

94%leaked at least one form of identifying metadata

What "leaked" actually means here

We bucketed findings into three severities. Critical is "selectable text is physically present under a visible opaque rectangle, and an attacker can recover it by selecting and copying." Warning is "identifying metadata that the redactor probably didn't realize was there" — author name in the /Info dictionary, software trail in the Producer field, full XMP metadata streams, bookmark titles that quote the redacted section. Info is "structural features that could carry data but rarely do in the wild" — optional content groups, AcroForm field placeholders.

Failure mode	Documents	% of 63 audited	Severity
Selectable text under redaction rectangles	11	17.5%	Critical
Producer field set (software trail)	49	77.8%	Warning
XMP metadata stream present	56	88.9%	Warning
Document Title field set	42	66.7%	Warning
Creator field set	39	61.9%	Warning
Author field set	38	60.3%	Warning
Document bookmarks present	22	34.9%	Warning
Subject field set	13	20.6%	Warning
Optional Content Groups (layers)	4	6.3%	Info
Keywords field set	3	4.8%	Warning
Unapplied `/Redact` annotations	0	0%	Critical
Embedded original-source files	0	0%	Critical
XFA form data	0	0%	Info

The critical leaks are concentrated at federal agencies

Splitting the 63 audited documents by publisher category, the rate of "selectable text under a redaction rectangle" leaks is sharply uneven:

Publisher category	n	Critical leaks	Any metadata leak
US federal agency	26	8 (30.8%)	21 (81%)
US state / local government	27	1 (3.7%)	24 (89%)
Media organization	3	1 (33%)	3
US courts (filings)	4	0	2
International courts (ICC, ICTY)	3	0	2
Foreign government	1	0	1
NGO / advocacy	4	0	4
University	3	0	1
Corporate (regulatory filing)	1	0	1

The sample is too small to make confident inferences below ~25 documents per category, but the federal-agency numbers are large enough to flag. Nearly one in three US federal agency documents in our sample had text physically recoverable under the visible redaction rectangles. The federal-leak set spanned the FTC (three separate complaints), DOJ (Office of the Inspector General, Antitrust Division, and Civil settlements), DNI, and one EPA action memo. Court filings (US and international) fared notably better.

Why this keeps happening

A PDF isn't a flat image. It's a structured document with separate layers for text content, graphics, fonts, annotations, metadata, and outlines. When a user draws an opaque rectangle in Preview, Acrobat's annotation tool, or any markup app, that rectangle goes on top — the text layer is untouched. Anyone who opens the file can select the redacted region, copy it, and paste readable text.

The fix is straightforward in theory: every modern PDF library has a "true redaction" operation that physically deletes the text from the content stream. The barrier is mostly that:

The true redaction tool in Acrobat is in a different menu from the annotation tool, and the icons look similar enough that the wrong one gets used.
Many free PDF tools advertise "redaction" but only draw a black rectangle.
Cheap third-party SaaS redactors do the same overlay trick, then upload to a server, which compounds the privacy problem.
Most workflows ship the PDF straight out without verifying. No "did the redaction actually work?" check happens.

The metadata leaks are even simpler — the redactor stripped the body but forgot the document properties dialog. Title, Author, Subject, and Keywords frequently survive because they're a separate dictionary the editing tool doesn't touch.

The pattern across the eleven critical leaks

Across the 11 documents with text recoverable under redaction rectangles, we found that:

The leaks cluster on a small number of documents. Four of the eleven had 5 or more independent rectangle-text leaks; one had 8, another had 7, another had 6. The other seven had 1–2 each. Multi-page filings that messed up redaction usually messed it up many times.
The Producer field consistently exposed the software stack. Every document in the critical-leak set named the PDF generator — usually a specific Acrobat or alternative library version. Useful for an attacker fingerprinting the upstream workflow.
XMP metadata was almost universal. 89% of all audited documents carry an XMP stream. Of those, many include identifying strings well beyond what Acrobat's "Document Properties" dialog displays.
Bookmark titles repeatedly quoted redacted section headings. 35% of audited docs had a non-empty document outline. The outline survives even when the body text is genuinely redacted, and bookmark titles often summarize the very content the redactor was trying to hide.

We aren't naming any of the eleven documents. The point of the study isn't to embarrass specific publishers — it's to make clear that the failure rate, on a population of documents whose publishers explicitly believed they had redacted them, is meaningfully nonzero across every category we sampled with enough volume to measure.

Verify your own redactions →

Free, runs entirely in your browser, your PDF never leaves your device.

How to actually redact

If you're producing redacted PDFs and you're not 100% sure your tool is doing content-stream removal vs. drawing rectangles, three things to do right now:

Test the basic case. Open your "redacted" output, hit Cmd-A / Ctrl-A to select all, copy, and paste into a plain text editor. If you see the text you thought you redacted, your tool failed.
Strip metadata as a separate step. Even with a true content-stream redaction, the /Info dictionary and XMP stream survive unless you explicitly clear them. A metadata scrub takes seconds.
Flatten through a "print to PDF" pass for high-stakes documents. If you're publishing a redacted court filing, FOIA release, or regulatory disclosure, the safest belt-and-suspenders is a rasterize-and-re-OCR pass that destroys any latent text layer. Some legal-tech tooling does this automatically; most consumer PDF tools don't.

Methodology

72 PDFs collected on 2026-06-07 via a dozen Google search queries of the form filetype:pdf intitle:redacted, filetype:pdf inurl:redacted, plus variants narrowed by document type (complaint, settlement, deposition, indictment, audit report, etc.) and publisher domain (site:.gov). Only PDFs explicitly titled or URL-marked as redacted versions were included; guidance documents about how to redact were excluded.

Each PDF was downloaded (30 MB cap; HTTP timeouts at 45 seconds) and audited locally using the same PDF engine that powers our in-browser tools. Per-document detection covers:

Uniform-fill rectangle detection via page rasterization at 1×, then a per-4×4-block variance check to identify opaque axis-aligned regions of any color (not only black). Rectangles smaller than 24×12 px are excluded to suppress bullets, table-cell shading, and decorative icons.
PDF annotation walk — /Square, /Circle, /Polygon, /Ink, /Highlight, /FreeText, /Redact, and related subtypes — for cases where the redaction is an annotation rather than a baked-in graphic.
Text-overlap check: for each candidate rectangle, the structured text stream is queried for lines whose bounding box overlaps the rectangle by ≥30% (in either direction). The recovered snippet is sliced to just the portion of the line that falls inside the rectangle. Snippets with fewer than 4 alphanumeric characters are discarded as noise; identical snippets on the same page are deduplicated.
The /Info dictionary, XMP metadata stream, document outline, embedded files, optional content groups, AcroForm/XFA presence — all read directly from the PDF structure.
The first 30 pages of each PDF were scanned. PDFs longer than 30 pages had only their first 30 audited; the true critical-leak rate is therefore a floor, not a ceiling.

The 2026-06-08 methodology update

The original 2026-06-07 detector required R, G, B all < 30 to mark a pixel as "redaction-like" — i.e. only near-pure-black overlays. That missed dark-grey rectangles, anti-aliased edges, and any non-black color a redactor might have used (including the orange and dark-blue annotation rectangles people frequently use in markup tools).

The updated detector flags any uniform-color block, then filters quality with three rules: minimum 24×12 px (kills bullet-point glyphs and icons), recovered text must contain at least 4 alphanumeric characters (kills snippets that are pure punctuation or whitespace), and identical snippets on the same page are counted once (kills decorative-graphic repeats). The detector was also extended to walk PDF annotations directly, so user-drawn Square / Ink / Highlight overlays in editor-modified files are caught by structural inspection, not pixel detection.

Re-running the new detector on the same 72 cached PDFs raised the critical-leak rate from 13% (8 docs) to 17% (11 docs). The metadata-leak numbers were essentially unchanged. Per-doc snippets remain in our gitignored audit folder and are not published.

Limitations

Sample size: 72 is enough to flag a pattern, not enough for tight per-category confidence intervals. Per-publisher-type rates below 25 documents should be read as directional only.

Selection bias: Google-indexed PDFs are not a random sample of all redacted PDFs. Documents from agencies that publish frequently and configure crawlers to find them are over-represented vs. documents that were redacted carefully and never indexed.

Detection bias: our rectangle detector requires a uniform-color axis-aligned block of at least 24×12 pixels. Redactions implemented as rasterized image overlays, irregular shapes (hand-drawn scribbles), or per-character whiteout strokes are not detected. This means the 17% critical-leak rate is a lower bound for the "drew an opaque shape on top of text" failure mode and entirely misses other failure modes.

What's not in scope: we did not audit OCR-layer leaks under image-only redactions, scanned-document redactions where the original soft copy is also published elsewhere, or temporal leaks (earlier draft versions of the same document, embedded thumbnails carrying pre-redaction renders). All of these have happened in real cases and would push the true leak rate higher.

What we did not publish

Per-document URLs, filenames, and identifying details are not included anywhere in this article. Recovered text content from the 11 critical-leak documents is not published or quoted, even partially. Aggregate counts and category rollups are the only data surfaced here.

The raw per-document audit results stayed on the auditor's machine in a gitignored folder. They were used only to compute the aggregate numbers above.

Reproduce this

The detection logic is the same one running in our in-browser unredact tool. Drop any of your own supposedly-redacted PDFs into it and you'll see the same audit your file would have shown in this study.