We audited 72 "redacted" PDFs. 17% still leak text — and 31% of federal ones do.
PDF redaction has been a solved problem for two decades. Acrobat has a true-removal redact tool. So does every serious legal-tech vendor. So does the open-source MuPDF stack we used for this study. And yet, in 2019 Paul Manafort's sealed court filing leaked because someone drew black rectangles instead of redacting. In 2021 the EU Commission's AstraZeneca contract leaked because someone left the document bookmarks intact. The pattern keeps happening because looking redacted and being redacted are two different things — and almost nobody verifies the difference.
- We collected 72 publicly indexed PDFs whose titles or URLs contained the word "redacted" — court filings, FOIA releases, government investigation reports, regulatory complaints, contracts.
- 63 audited successfully. 11 of them (17%) still have selectable text physically present under the redaction rectangles. 558 characters of supposed-to-be-redacted text were recoverable across those 11 documents.
- 59 of 63 (94%) leaked some form of identifying metadata — author name, document title, software trail, bookmark titles. Lower-severity than text-under-the-bars, but still a leak.
- The most-affected publisher category was US federal agencies — nearly 1 in 3 had recoverable text under the rectangles.
- Zero of the 63 audited documents had unapplied
/Redactannotations or embedded original-source files. The failure mode is overwhelmingly "drew an opaque shape on top of selectable text, didn't actually delete the text."
The headline numbers
What "leaked" actually means here
We bucketed findings into three severities. Critical is "selectable text is physically present under a visible opaque rectangle, and an attacker can recover it by selecting and copying." Warning is "identifying metadata that the redactor probably didn't realize was there" — author name in the /Info dictionary, software trail in the Producer field, full XMP metadata streams, bookmark titles that quote the redacted section. Info is "structural features that could carry data but rarely do in the wild" — optional content groups, AcroForm field placeholders.
| Failure mode | Documents | % of 63 audited | Severity |
|---|---|---|---|
| Selectable text under redaction rectangles | 11 | 17.5% | Critical |
| Producer field set (software trail) | 49 | 77.8% | Warning |
| XMP metadata stream present | 56 | 88.9% | Warning |
| Document Title field set | 42 | 66.7% | Warning |
| Creator field set | 39 | 61.9% | Warning |
| Author field set | 38 | 60.3% | Warning |
| Document bookmarks present | 22 | 34.9% | Warning |
| Subject field set | 13 | 20.6% | Warning |
| Optional Content Groups (layers) | 4 | 6.3% | Info |
| Keywords field set | 3 | 4.8% | Warning |
Unapplied /Redact annotations | 0 | 0% | Critical |
| Embedded original-source files | 0 | 0% | Critical |
| XFA form data | 0 | 0% | Info |
The critical leaks are concentrated at federal agencies
Splitting the 63 audited documents by publisher category, the rate of "selectable text under a redaction rectangle" leaks is sharply uneven:
| Publisher category | n | Critical leaks | Any metadata leak |
|---|---|---|---|
| US federal agency | 26 | 8 (30.8%) | 21 (81%) |
| US state / local government | 27 | 1 (3.7%) | 24 (89%) |
| Media organization | 3 | 1 (33%) | 3 |
| US courts (filings) | 4 | 0 | 2 |
| International courts (ICC, ICTY) | 3 | 0 | 2 |
| Foreign government | 1 | 0 | 1 |
| NGO / advocacy | 4 | 0 | 4 |
| University | 3 | 0 | 1 |
| Corporate (regulatory filing) | 1 | 0 | 1 |
The sample is too small to make confident inferences below ~25 documents per category, but the federal-agency numbers are large enough to flag. Nearly one in three US federal agency documents in our sample had text physically recoverable under the visible redaction rectangles. The federal-leak set spanned the FTC (three separate complaints), DOJ (Office of the Inspector General, Antitrust Division, and Civil settlements), DNI, and one EPA action memo. Court filings (US and international) fared notably better.
Why this keeps happening
A PDF isn't a flat image. It's a structured document with separate layers for text content, graphics, fonts, annotations, metadata, and outlines. When a user draws an opaque rectangle in Preview, Acrobat's annotation tool, or any markup app, that rectangle goes on top — the text layer is untouched. Anyone who opens the file can select the redacted region, copy it, and paste readable text.
The fix is straightforward in theory: every modern PDF library has a "true redaction" operation that physically deletes the text from the content stream. The barrier is mostly that:
- The true redaction tool in Acrobat is in a different menu from the annotation tool, and the icons look similar enough that the wrong one gets used.
- Many free PDF tools advertise "redaction" but only draw a black rectangle.
- Cheap third-party SaaS redactors do the same overlay trick, then upload to a server, which compounds the privacy problem.
- Most workflows ship the PDF straight out without verifying. No "did the redaction actually work?" check happens.
The metadata leaks are even simpler — the redactor stripped the body but forgot the document properties dialog. Title, Author, Subject, and Keywords frequently survive because they're a separate dictionary the editing tool doesn't touch.
The pattern across the eleven critical leaks
Across the 11 documents with text recoverable under redaction rectangles, we found that:
- The leaks cluster on a small number of documents. Four of the eleven had 5 or more independent rectangle-text leaks; one had 8, another had 7, another had 6. The other seven had 1–2 each. Multi-page filings that messed up redaction usually messed it up many times.
- The Producer field consistently exposed the software stack. Every document in the critical-leak set named the PDF generator — usually a specific Acrobat or alternative library version. Useful for an attacker fingerprinting the upstream workflow.
- XMP metadata was almost universal. 89% of all audited documents carry an XMP stream. Of those, many include identifying strings well beyond what Acrobat's "Document Properties" dialog displays.
- Bookmark titles repeatedly quoted redacted section headings. 35% of audited docs had a non-empty document outline. The outline survives even when the body text is genuinely redacted, and bookmark titles often summarize the very content the redactor was trying to hide.
We aren't naming any of the eleven documents. The point of the study isn't to embarrass specific publishers — it's to make clear that the failure rate, on a population of documents whose publishers explicitly believed they had redacted them, is meaningfully nonzero across every category we sampled with enough volume to measure.
Free, runs entirely in your browser, your PDF never leaves your device.
How to actually redact
If you're producing redacted PDFs and you're not 100% sure your tool is doing content-stream removal vs. drawing rectangles, three things to do right now:
- Test the basic case. Open your "redacted" output, hit Cmd-A / Ctrl-A to select all, copy, and paste into a plain text editor. If you see the text you thought you redacted, your tool failed.
- Strip metadata as a separate step. Even with a true content-stream redaction, the
/Infodictionary and XMP stream survive unless you explicitly clear them. A metadata scrub takes seconds. - Flatten through a "print to PDF" pass for high-stakes documents. If you're publishing a redacted court filing, FOIA release, or regulatory disclosure, the safest belt-and-suspenders is a rasterize-and-re-OCR pass that destroys any latent text layer. Some legal-tech tooling does this automatically; most consumer PDF tools don't.
Methodology
72 PDFs collected on 2026-06-07 via a dozen Google search queries of the form filetype:pdf intitle:redacted, filetype:pdf inurl:redacted, plus variants narrowed by document type (complaint, settlement, deposition, indictment, audit report, etc.) and publisher domain (site:.gov). Only PDFs explicitly titled or URL-marked as redacted versions were included; guidance documents about how to redact were excluded.
Each PDF was downloaded (30 MB cap; HTTP timeouts at 45 seconds) and audited locally using the MuPDF WebAssembly engine — the same engine that powers our in-browser PDF tools. Per-document detection covers:
- Uniform-fill rectangle detection via page rasterization at 1×, then a per-4×4-block variance check to identify opaque axis-aligned regions of any color (not only black). Rectangles smaller than 24×12 px are excluded to suppress bullets, table-cell shading, and decorative icons.
- PDF annotation walk —
/Square,/Circle,/Polygon,/Ink,/Highlight,/FreeText,/Redact, and related subtypes — for cases where the redaction is an annotation rather than a baked-in graphic. - Text-overlap check: for each candidate rectangle, the structured text stream is queried for lines whose bounding box overlaps the rectangle by ≥30% (in either direction). The recovered snippet is sliced to just the portion of the line that falls inside the rectangle. Snippets with fewer than 4 alphanumeric characters are discarded as noise; identical snippets on the same page are deduplicated.
- The
/Infodictionary, XMP metadata stream, document outline, embedded files, optional content groups, AcroForm/XFA presence — all read directly from the PDF structure. - The first 30 pages of each PDF were scanned. PDFs longer than 30 pages had only their first 30 audited; the true critical-leak rate is therefore a floor, not a ceiling.
The 2026-06-08 methodology update
The original 2026-06-07 detector required R, G, B all < 30 to mark a pixel as "redaction-like" — i.e. only near-pure-black overlays. That missed dark-grey rectangles, anti-aliased edges, and any non-black color a redactor might have used (including the orange and dark-blue annotation rectangles people frequently use in markup tools).
The updated detector flags any uniform-color block, then filters quality with three rules: minimum 24×12 px (kills bullet-point glyphs and icons), recovered text must contain at least 4 alphanumeric characters (kills snippets that are pure punctuation or whitespace), and identical snippets on the same page are counted once (kills decorative-graphic repeats). The detector was also extended to walk PDF annotations directly, so user-drawn Square / Ink / Highlight overlays in editor-modified files are caught by structural inspection, not pixel detection.
Re-running the new detector on the same 72 cached PDFs raised the critical-leak rate from 13% (8 docs) to 17% (11 docs). The metadata-leak numbers were essentially unchanged. Per-doc snippets remain in our gitignored audit folder and are not published.
Limitations
Sample size: 72 is enough to flag a pattern, not enough for tight per-category confidence intervals. Per-publisher-type rates below 25 documents should be read as directional only.
Selection bias: Google-indexed PDFs are not a random sample of all redacted PDFs. Documents from agencies that publish frequently and configure crawlers to find them are over-represented vs. documents that were redacted carefully and never indexed.
Detection bias: our rectangle detector requires a uniform-color axis-aligned block of at least 24×12 pixels. Redactions implemented as rasterized image overlays, irregular shapes (hand-drawn scribbles), or per-character whiteout strokes are not detected. This means the 17% critical-leak rate is a lower bound for the "drew an opaque shape on top of text" failure mode and entirely misses other failure modes.
What's not in scope: we did not audit OCR-layer leaks under image-only redactions, scanned-document redactions where the original soft copy is also published elsewhere, or temporal leaks (earlier draft versions of the same document, embedded thumbnails carrying pre-redaction renders). All of these have happened in real cases and would push the true leak rate higher.
What we did not publish
Per-document URLs, filenames, and identifying details are not included anywhere in this article. Recovered text content from the 11 critical-leak documents is not published or quoted, even partially. Aggregate counts and category rollups are the only data surfaced here.
The raw per-document audit results stayed on the auditor's machine in a gitignored folder. They were used only to compute the aggregate numbers above.
Reproduce this
The detection logic is the same one running in our in-browser unredact tool. Drop any of your own supposedly-redacted PDFs into it and you'll see the same audit your file would have shown in this study.