We audited 72 "redacted" PDFs. 17% still leak text — and 31% of federal ones do.

Essex Software research · 2026-06-07 · updated 2026-06-08

PDF redaction has been a solved problem for two decades. Acrobat has a true-removal redact tool. So does every serious legal-tech vendor. So does the open-source MuPDF stack we used for this study. And yet, in 2019 Paul Manafort's sealed court filing leaked because someone drew black rectangles instead of redacting. In 2021 the EU Commission's AstraZeneca contract leaked because someone left the document bookmarks intact. The pattern keeps happening because looking redacted and being redacted are two different things — and almost nobody verifies the difference.

Updated 2026-06-08: Our original detector (published 2026-06-07) only caught near-black redaction rectangles, missing dark-grey overlays with mild anti-aliasing. The updated detector catches uniform-fill rectangles of any color while filtering out small icons, bullet points, and short snippets to suppress false positives. Re-running it on the same 72 PDFs raised the critical-leak rate from 13% to 17%. The federal-agency rate went from 19% to 31%. Methodology note at the bottom describes the change.
TL;DR

The headline numbers

72PDFs collected via public Google search
63successfully audited (9 returned HTTP errors)
11had text recoverable under redaction rectangles
94%leaked at least one form of identifying metadata

What "leaked" actually means here

We bucketed findings into three severities. Critical is "selectable text is physically present under a visible opaque rectangle, and an attacker can recover it by selecting and copying." Warning is "identifying metadata that the redactor probably didn't realize was there" — author name in the /Info dictionary, software trail in the Producer field, full XMP metadata streams, bookmark titles that quote the redacted section. Info is "structural features that could carry data but rarely do in the wild" — optional content groups, AcroForm field placeholders.

Failure modeDocuments% of 63 auditedSeverity
Selectable text under redaction rectangles1117.5%Critical
Producer field set (software trail)4977.8%Warning
XMP metadata stream present5688.9%Warning
Document Title field set4266.7%Warning
Creator field set3961.9%Warning
Author field set3860.3%Warning
Document bookmarks present2234.9%Warning
Subject field set1320.6%Warning
Optional Content Groups (layers)46.3%Info
Keywords field set34.8%Warning
Unapplied /Redact annotations00%Critical
Embedded original-source files00%Critical
XFA form data00%Info

The critical leaks are concentrated at federal agencies

Splitting the 63 audited documents by publisher category, the rate of "selectable text under a redaction rectangle" leaks is sharply uneven:

Publisher categorynCritical leaksAny metadata leak
US federal agency268 (30.8%)21 (81%)
US state / local government271 (3.7%)24 (89%)
Media organization31 (33%)3
US courts (filings)402
International courts (ICC, ICTY)302
Foreign government101
NGO / advocacy404
University301
Corporate (regulatory filing)101

The sample is too small to make confident inferences below ~25 documents per category, but the federal-agency numbers are large enough to flag. Nearly one in three US federal agency documents in our sample had text physically recoverable under the visible redaction rectangles. The federal-leak set spanned the FTC (three separate complaints), DOJ (Office of the Inspector General, Antitrust Division, and Civil settlements), DNI, and one EPA action memo. Court filings (US and international) fared notably better.

Why this keeps happening

A PDF isn't a flat image. It's a structured document with separate layers for text content, graphics, fonts, annotations, metadata, and outlines. When a user draws an opaque rectangle in Preview, Acrobat's annotation tool, or any markup app, that rectangle goes on top — the text layer is untouched. Anyone who opens the file can select the redacted region, copy it, and paste readable text.

The fix is straightforward in theory: every modern PDF library has a "true redaction" operation that physically deletes the text from the content stream. The barrier is mostly that:

The metadata leaks are even simpler — the redactor stripped the body but forgot the document properties dialog. Title, Author, Subject, and Keywords frequently survive because they're a separate dictionary the editing tool doesn't touch.

The pattern across the eleven critical leaks

Across the 11 documents with text recoverable under redaction rectangles, we found that:

We aren't naming any of the eleven documents. The point of the study isn't to embarrass specific publishers — it's to make clear that the failure rate, on a population of documents whose publishers explicitly believed they had redacted them, is meaningfully nonzero across every category we sampled with enough volume to measure.

Verify your own redactions →

Free, runs entirely in your browser, your PDF never leaves your device.

How to actually redact

If you're producing redacted PDFs and you're not 100% sure your tool is doing content-stream removal vs. drawing rectangles, three things to do right now:

  1. Test the basic case. Open your "redacted" output, hit Cmd-A / Ctrl-A to select all, copy, and paste into a plain text editor. If you see the text you thought you redacted, your tool failed.
  2. Strip metadata as a separate step. Even with a true content-stream redaction, the /Info dictionary and XMP stream survive unless you explicitly clear them. A metadata scrub takes seconds.
  3. Flatten through a "print to PDF" pass for high-stakes documents. If you're publishing a redacted court filing, FOIA release, or regulatory disclosure, the safest belt-and-suspenders is a rasterize-and-re-OCR pass that destroys any latent text layer. Some legal-tech tooling does this automatically; most consumer PDF tools don't.

Methodology

72 PDFs collected on 2026-06-07 via a dozen Google search queries of the form filetype:pdf intitle:redacted, filetype:pdf inurl:redacted, plus variants narrowed by document type (complaint, settlement, deposition, indictment, audit report, etc.) and publisher domain (site:.gov). Only PDFs explicitly titled or URL-marked as redacted versions were included; guidance documents about how to redact were excluded.

Each PDF was downloaded (30 MB cap; HTTP timeouts at 45 seconds) and audited locally using the MuPDF WebAssembly engine — the same engine that powers our in-browser PDF tools. Per-document detection covers:

The 2026-06-08 methodology update

The original 2026-06-07 detector required R, G, B all < 30 to mark a pixel as "redaction-like" — i.e. only near-pure-black overlays. That missed dark-grey rectangles, anti-aliased edges, and any non-black color a redactor might have used (including the orange and dark-blue annotation rectangles people frequently use in markup tools).

The updated detector flags any uniform-color block, then filters quality with three rules: minimum 24×12 px (kills bullet-point glyphs and icons), recovered text must contain at least 4 alphanumeric characters (kills snippets that are pure punctuation or whitespace), and identical snippets on the same page are counted once (kills decorative-graphic repeats). The detector was also extended to walk PDF annotations directly, so user-drawn Square / Ink / Highlight overlays in editor-modified files are caught by structural inspection, not pixel detection.

Re-running the new detector on the same 72 cached PDFs raised the critical-leak rate from 13% (8 docs) to 17% (11 docs). The metadata-leak numbers were essentially unchanged. Per-doc snippets remain in our gitignored audit folder and are not published.

Limitations

Sample size: 72 is enough to flag a pattern, not enough for tight per-category confidence intervals. Per-publisher-type rates below 25 documents should be read as directional only.

Selection bias: Google-indexed PDFs are not a random sample of all redacted PDFs. Documents from agencies that publish frequently and configure crawlers to find them are over-represented vs. documents that were redacted carefully and never indexed.

Detection bias: our rectangle detector requires a uniform-color axis-aligned block of at least 24×12 pixels. Redactions implemented as rasterized image overlays, irregular shapes (hand-drawn scribbles), or per-character whiteout strokes are not detected. This means the 17% critical-leak rate is a lower bound for the "drew an opaque shape on top of text" failure mode and entirely misses other failure modes.

What's not in scope: we did not audit OCR-layer leaks under image-only redactions, scanned-document redactions where the original soft copy is also published elsewhere, or temporal leaks (earlier draft versions of the same document, embedded thumbnails carrying pre-redaction renders). All of these have happened in real cases and would push the true leak rate higher.

What we did not publish

Per-document URLs, filenames, and identifying details are not included anywhere in this article. Recovered text content from the 11 critical-leak documents is not published or quoted, even partially. Aggregate counts and category rollups are the only data surfaced here.

The raw per-document audit results stayed on the auditor's machine in a gitignored folder. They were used only to compute the aggregate numbers above.

Reproduce this

The detection logic is the same one running in our in-browser unredact tool. Drop any of your own supposedly-redacted PDFs into it and you'll see the same audit your file would have shown in this study.