Stirling-PDF icon indicating copy to clipboard operation
Stirling-PDF copied to clipboard

feat: Add text removal to auto-redact

Open balazs-szucs opened this issue 4 months ago • 19 comments

Description of Changes

Intro of new redaction code centered RedactionService, enabling removal of targeted text from PDFs at the token level. It traverses pages, nested content, and graphics patterns, and scrubs semantic properties to prevent recovery, with kerning/spacing compensation to reduce layout shifts. OCR as discussed in comments for "back-up". Relies on PDFBox low level methods, means we inherit some unpredictability with them.

What’s new

  • RedactionService
    • Single, cohesive redaction engine at app/core/src/main/java/stirling/software/SPDF/service/RedactionService.java.
    • Route to ModerateRedactionService, VisualRedactionService, AggressiveRedactionService.
  • Token‑level content stream rewriting
    • Parses and rewrites text‑showing operators (Tj, TJ, ', "), removing matched content from streams.
    • Segment model (TextSegment) captures operator, font, size, and stream positions for precise edits.
    • Handles complex TJ arrays with kerning numbers among strings.
  • Nested content and graphics coverage
    • Traverses PDResources to process Form XObjects and PDTilingPattern streams; rewrites nested content.
    • Writes redacted streams back to pages, XObjects, and patterns safely.
  • Removal of Semantic operators
    • Removes semantic text from marked content and properties: ActualText, Alt, TU.
    • Structure tree and annotation scrubbing for alternate/accessible text.
  • Font and encoding
    • Decoding fallbacks for subset/embedded fonts using TextDecodingHelper.
    • Safe string length and byte‑range mapping for precise modifications.
    • Width calculation helpers and guarded kerning to preserve layout where possible.
  • Aggressive removal mode (text deletion)
    • Multi‑sweep algorithm (up to 3 passes) to catch residual and nested matches.
    • Per‑segment decoded‑range deletion with byte‑accurate edits for both Tj and TJ.
    • Residual safety pass that wipes text operators/semantics if any trace remains.
  • Layout‑preserving PDF token removal
    • Kerning compensation after removals to reduce visible gaps.
    • Bounds/thresholds to avoid over‑adjustment; safe fallbacks if metrics are unreliable.
  • Safety
    • Defensive token validation and deep‑copying; emergency fallbacks for edge cases.
    • Gibberish/control‑char normalization to avoid corrupting streams.
    • Guardrails on task volume and precision to keep edits stable. reuse.
  • UI update
    • Auto‑Redact page adds a “Remove Text” option explaining that content is deleted and layout may shift.
    • Feature on a opt-in basis, users who do not touch the UI, will get same exact Redaction as they did before.

Implementation notes (high level, main "brain" methods)

Core changes centered in RedactionService:

  • Tokenization and redaction: createTokensWithoutTargetText, applyRedactionsToTokens, modifyTokenForRedaction, modifyTJOperator.
  • Semantic scrubbing: wipeAllSemanticTextInTokens, wipeAllSemanticTextInProperties, DefaultSemanticScrubber.
  • Nested/graphics handling: processFormXObject, wipeAllTextInXObjects, wipeAllTextInPatterns.
  • Safety and quality: width calculations, kerning adjustment, decoding fallbacks, multi‑sweep residual handling.

OCR Integration for Improved Detection

  • An OCR pass has been integrated as a fallback mechanism.

Why

  • Ensure targeted strings are actually removed from the source PDF, not just visually hidden, improving redaction safety.

Risks

  • Minor spacing shifts possible on complex or heavily subsetted fonts despite kerning correction.
  • Exotic operator sequences or rare encodings may require additional specialized handling. This is not something we/I can compensate for so, not a possible future improvement
  • Some PDF might take longer. Due to the amount of way the code tries to remove/the amount fallbacks

Front-end (preliminary)

image

TODOs

  • Front-end adjustments (done)
  • Minor touch ups (probably) (done)
  • Code quality improvements (done)
  • Feedback on current methods (done)
  • Explanation to put onto front-end (done)
  • Cleanup (done)
  • OCR Pass on PDF (might help achieving better results, help eligibility in moderate mode) (done)
  • PDF/A can better "burn" in the results according to some guy on stack-overflow, TODO: Implement if makes sense, to research before merge (discarded)

Closes: #499


Checklist

General

Documentation

UI Changes (if applicable)

  • [x] Screenshots or videos demonstrating the UI changes are attached (e.g., as comments or direct attachments in the PR)

Testing (if applicable)

  • [x] I have tested my changes locally. Refer to the Testing Guide for more details.

balazs-szucs avatar Aug 19 '25 20:08 balazs-szucs

Hey thank you so much for this function, it is something I have been looking for for some time now.

My use case for this function is actually slightly different from what you would imagine, I want to use it for watermark removal. Some of the PDFs I downloaded from certain website has an ugly watermark textbox on the bottom of each page, something like "Provided by XXX.COM", which I assume can be completely removed using the "aggressive" mode of redact proposed in this PR. Since it's a watermark so aesthetically speaking a black box is not ideal.

Anyway, wish to test it out if possible, I know this is still WIP so no rush, just wish to test it out when this PR is somewhat mature.

christaikobo avatar Aug 22 '25 00:08 christaikobo

Hi,

Thanks for your comment.

Anyway, wish to test it out if possible, I know this is still WIP so no rush, just wish to test it out when this PR is somewhat mature.

You can clone the branch from my profile at https://github.com/balazs-szucs/Stirling-PDF/tree/aggresive-redact and run it locally

Sadly, however it may not be 100% functional in your use case, due to fact the PDF spec is rather permissive that means that if the author of the PDF really wanted to water-mark the PDF, they may have obfuscated the text in a way that makes it not possible to remove without more sophisticated methods than what's implemented here.

Generally, the focus of this implementation has been to be able to remove text lives inside the PDF's /Contents object(s), that means text that lives on other types of objects e.g.,: images, /XObjects etc... will not get removed. This has been a rather large feature, and is more or less inline/similar to other open-source project's implementations (feature-wise), but compared to more "enterprise" solution it's quite lackluster.

For better redaction we would more than likely need PDFium or something similar, due to fact none of the current dependencies were meant for this kind usage (e.g., PDFBox)

To end this comment on a more positive note, hopefully no stone will be left unturned with this PR, so it would more than good enough feature for the majority of users (even with its limitation)

balazs-szucs avatar Aug 22 '25 21:08 balazs-szucs

Thanks for the response, much appreciated!

So far the watermarks I have encountered are plain text boxes, which can be easily edited or deleted in Adobe acrobat, only problem is that I have to do it for every page and the PDF is 2000 page long... I am pretty optimistic that your implementation could work.

I will try to test it this weekend and if I encounter anything, I will report back to you.

On Fri, Aug 22, 2025, 14:44 Balázs Szücs @.***> wrote:

balazs-szucs left a comment (Stirling-Tools/Stirling-PDF#4240) https://github.com/Stirling-Tools/Stirling-PDF/pull/4240#issuecomment-3215751642

Hi,

Thanks for your comment.

Anyway, wish to test it out if possible, I know this is still WIP so no rush, just wish to test it out when this PR is somewhat mature.

You can clone the branch from my profile at https://github.com/balazs-szucs/Stirling-PDF/tree/aggresive-redact and run it locally

Sadly, however it may not be 100% functional in your use case, due to fact the PDF spec is rather permissive that means that if the author of the PDF really wanted to water-mark the PDF, they may have obfuscated the text in a way that makes it not possible to remove without more sophisticated methods than what's implemented here.

Generally, the focus of this implementation has been to be able to remove text lives inside the PDF's /Contents object(s), that means text that lives on other types of objects e.g.,: images, /XObjects etc... will not get removed. This has been a rather large feature, and is more or less inline/similar to other open-source project's implementations (feature-wise), but compared to more "enterprise" solution it's quite lackluster.

For better redaction we would more than likely need PDFium or something similar, due to fact none of the current dependencies were meant for this kind usage (e.g., PDFBox)

To end this comment on a more positive note, hopefully no stone will be left unturned with this PR, so it would more than good enough feature for the majority of users (even with its limitation)

— Reply to this email directly, view it on GitHub https://github.com/Stirling-Tools/Stirling-PDF/pull/4240#issuecomment-3215751642, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFY26LBY7BTWR6U4PA5GV5L3O6FL7AVCNFSM6AAAAACEJMANYSVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTEMJVG42TCNRUGI . You are receiving this because you are subscribed to this thread.Message ID: @.***>

christaikobo avatar Aug 23 '25 00:08 christaikobo

/prdeploy

Ludy87 avatar Aug 23 '25 12:08 Ludy87

Quick notes:

  • It is indeed a draft/working progress
  • I am trying to aggressively cut down size, remove duplicates/circular methods
  • The brain methods should be rather flexible; the aesthetic stuff can be easily changed these are "my" recommendations
  • Weird edge cases; might still be there.

balazs-szucs avatar Aug 23 '25 13:08 balazs-szucs

Hi,

Little success update redaction now supports forms/javascript. (or more accurately does not ruin them)

Examples; forms_redacted_form.pdf PDFS_CopyPastListEntries_redacted_javascript.pdf

balazs-szucs avatar Aug 23 '25 20:08 balazs-szucs

I feel so sorry to ask this, but can you give tell me a bit more hint on how to test your branch?

christaikobo avatar Aug 23 '25 20:08 christaikobo

I feel so sorry to ask this, but can you give tell me a bit more hint on how to test your branch?

No worries, in the meantime the branch was deployed at:

🔗 Test URL: http://185.252.234.121:4240/

Otherwise can clone the repo :) git clone <repo_url> cd <repo_folder> git checkout aggresive-redact

edit: apparently you cannot fork someones not main repo sorry about that; you have clone my https://github.com/balazs-szucs/Stirling-PDF/tree/aggresive-redact I think? I am not much of git wizard myself sadly

balazs-szucs avatar Aug 23 '25 20:08 balazs-szucs

I feel so sorry to ask this, but can you give tell me a bit more hint on how to test your branch?

No worries, in the meantime the branch was deployed at:

🔗 Test URL: http://185.252.234.121:4240/

Otherwise can clone the repo :) git clone <repo_url> cd <repo_folder> git checkout aggresive-redact

edit: apparently you cannot fork someones not main repo sorry about that; you have clone my https://github.com/balazs-szucs/Stirling-PDF/tree/aggresive-redact I think? I am not much of git wizard myself sadly

Thank you so much for the testing URL.

Tried my sample PDF, works flawlessly.

Before 2

After

3

Also the fact that it works for a 370M 2500 page long PDF is a proof that it is working under stress.

However there seems to be some slight visual problem: 1

Anyway much appreciated for the PR, looking forward for the merge.

christaikobo avatar Aug 23 '25 21:08 christaikobo

Thanks for the feedback :)

However there seems to be some slight visual problem:

This due to the fact I have been laser focused on the Java code and haven't made the time to update properties files but no worries that will get fixed, this is also reason I haven't included screenshots yet. I am also not 100% how the UI should actually look like.

Anyway much appreciated for the PR, looking forward for the merge.

Thank you :smile:, I am too looking forward to the merge, hopefully I can iron out everything, and wrap up in few days. As I said earlier hopefully no stone will be left unturned.

balazs-szucs avatar Aug 23 '25 21:08 balazs-szucs

Hi,

Quick update on my side:

I’ve decided to use the quirky but more reliable “PDF image conversion + OCR” approach. E.g., if redaction can’t remove the text directly, it will place a box over it, convert the page to a PDF image, and then OCR the remaining text. This is effectively 100% reliable because OCR can’t read the text behind the box. This way, users won’t have to guess whether redaction worked it will work every time, with varying outcomes. Best case: redaction is done via token removal. Worst case: via the conversion trick.

My (probably accurate) timeline: you can expect the final code by Tuesday at the latest. I’ll also clean up the existing code, which feels a bit bloated; I’m hopeful I can trim a few hundred lines.

Feedback and opinions welcome.

Here is what that looks like: invoicesample_redacted_OCR_redaction.pdf Lorem_ipsum_redacted_OCR_redaction.pdf

Obviously token removal is 100% better quality but this is not bad after all. (and it would absolutely guaranteed removal.)

@Frooodle Sorry for the ping; I know you’ve been occupied with V2 (I’m very excited about it). I’ve been doing a lot independently, so to avoid any surprises, would this approach work for you?

If so, I’ll simplify the UI and replace “Visual” "Moderate" "Aggressive" with options like “Structure-Preserving” and “Complete Removal” on the front end.

Structure-Preserving to kind of signal there will be some box placeholder, in Complete Removal, however no such thing, just straight removal of the token, no placeholder.

(this is not set in stone feel free to add your input :smile:)

balazs-szucs avatar Aug 24 '25 20:08 balazs-szucs

Brilliant Stuff!

I think the wording being 'Structure-Preserving' etc still needs a technical approach to understand its meaning and might not fit all users I think we are better off going for something like Smart Text Removal Force text Removal

Wouldnt make much sense for V1 since ts a bit awkward to do but for V2 I would see this going into an advanced option that most users wouldn't even see/click followed by a in-depth tooltip going into advanced detail for the people that want it

Frooodle avatar Aug 24 '25 21:08 Frooodle

Structure-Preserving mean there will be a black box, while Complete Removal means the space will be taken by subsequent text if there is any, right? e.g., if we are redacting "Sales Tax", if a text box only has "Sales Tax", it will be completely empty when using Complete Removal, and if a text box has "Please pay Sales Tax here" it will become "Please pay here" when using Complete Removal. Please correct me if I'm wrong.

I think categorizing them from an end result perspective is clearer to users. Users could potentially be overwhelmed and confused by the visual/moderate/aggressive naming.

However I am not so sure how OCR would fit into this, how is it decided whether "redaction can’t remove the text directly"?

christaikobo avatar Aug 24 '25 21:08 christaikobo

Hi,

Structure-Preserving means there will be a black box, while Complete Removal means the space will be taken by subsequent text if there is any, right?

This depends on how the PDF is encoded, but in the majority of cases, yes.

If we are redacting “Sales Tax,” and a text box only contains “Sales Tax,” it will be completely empty when using Complete Removal. If a text box contains “Please pay Sales Tax here,” it will become “Please pay here” when using Complete Removal.

Yes. Feel free to consult the examples I provided earlier. Aesthetically there will be no change to those just improved reliability.

I think categorizing them from an end-result perspective is clearer to users. Users could potentially be overwhelmed and confused by the visual/moderate/aggressive naming.

Fair point. 🙂

OCR would fit into this?

I didn’t explain this clearly before so a good question on how it ties together.

A quick explanation on how PDFs work: PDFs look consistent because the file contains both the encoded content and the “key” (technically, a mapping) to decode it. However, the PDF specification is quite permissive, so the mapping can vary widely. Decoding a PDF using its mapping is not easy but feasible. Adding text back that conforms perfectly to the original mapping especially with correct spacing and kerning is extremely challenging. Especially because one mistake can corrupt the whole file.

  • Moderate: Checks whether the PDF encoding is simple enough that we can remove the text and also reintroduce the necessary kerning/spacing to account for the removed text width (keep in mind, the "new" kerning is also encoded based on the mapping of the PDF, so we are not just pushing plain text into the PDF but strings that are encoded based on PDF's original mapping). This preserves the original structure, making before vs. after visually identical in layout.

  • Aggressive: Removes the text without reintroducing any kerning or placeholders, to avoid risking PDF corruption. Unfortunately, reliably adding text across all PDFs isn’t something I can guarantee yet.

Why OCR works: OCR reads only what is visibly rendered. Redacted content placed under a box is not visible, so OCR cannot read it, ensuring that the redaction is preserved in the OCRed output but the original text is "restored". It's cheesy trick, but I didn't want to do it unless it is necessary. Looks like it is.

I don't have examples right know how PDF look when it's corrupted but basically what you can image is random characters scattered across the whole page and nothing readable (like ? ( ) random Unicode characters etc etc...). This is what I want to avoid at all cost.

Hope I explained well.

balazs-szucs avatar Aug 24 '25 21:08 balazs-szucs

A quick explanation on how PDFs work: PDFs look consistent because the file contains both the encoded content and the “key” (technically, a mapping) to decode it. However, the PDF specification is quite permissive, so the mapping can vary widely. Decoding a PDF using its mapping is not easy but feasible. Adding text back that conforms perfectly to the original mapping especially with correct spacing and kerning is extremely challenging. Especially because one mistake can corrupt the whole file.

  • Moderate: Checks whether the PDF encoding is simple enough that we can remove the text and also reintroduce the necessary kerning/spacing to account for the removed text width (keep in mind, the "new" kerning is also encoded based on the mapping of the PDF, so we are not just pushing plain text into the PDF but strings that are encoded based on PDF's original mapping). This preserves the original structure, making before vs. after visually identical in layout.
  • Aggressive: Removes the text without reintroducing any kerning or placeholders, to avoid risking PDF corruption. Unfortunately, reliably adding text across all PDFs isn’t something I can guarantee yet.

Thanks for the detailed explanation, this part is very clear, and I think you are suggesting aggressive is more reliable while moderate is quirky since we are trying to write something back to the PDF file as opposed to simply cutting some parts out.

Why OCR works: OCR reads only what is visibly rendered. Redacted content placed under a box is not visible, so OCR cannot read it, ensuring that the redaction is preserved in the OCRed output but the original text is "restored". It's cheesy trick, but I didn't want to do it unless it is necessary. Looks like it is.

While I do have some basic understanding about how OCR works, and why it is preferable in certain situations when we are trying to redact content from a PDF file (by converting PDF file to image with a black box we are erasing any "hidden" content, ensuring the redaction is 100% successful), what I am uncertain about is when will it be considered "redaction can’t remove the text directly", since you said

if redaction can’t remove the text directly, it will place a box over it, convert the page to a PDF image, and then OCR the remaining text

My guess is that, for Complete Removal, it is unlikely to trigger OCR, because we are not writing back to the PDF, a simple deletion is easy to execute and unlikely to cause corruption; for Structure-Preserving, since we are trying to write back, we need to encode it the way it originally encodes the PDF and that is not always possible, if the checks fail and we are facing a messy encode, OCR will be triggered to ensure sensitive information is 100% redacted for ease of mind.

If my guess is right, I think maybe we can preserve the 3 tier function, but with different names:

  1. Unguaranteed Black Box Redaction (Preserve remaining text)
  2. Guaranteed Black Box Redaction (OCR remaining text)
  3. Complete Removal (Preserve remaining text)

The names are not very fancy but I think users could understand what each option achieves.

I think 3 key pieces of information is, a. what is the visual outcome of the redaction (black box vs text removal), b. is it a guaranteed redaction, will sensitive information be 100% redacted, c. is it going to preserve the original text (we know OCRed texts can be incorrect).

christaikobo avatar Aug 24 '25 22:08 christaikobo

To preface my response, I've been more focused on the Java backend rather than front-end details, so I'll refine these ideas in the next 1-2 days we have concrete examples and visuals. Things often click into place then, and it reduces the need for extended discussion (also this is getting bit off-topic, but personally don't mind it just saying) especially since, from a user's perspective, naming matters far less than consistency. People just want to open the app, rely on muscle memory, and have features work as expected (or better) without disrupting their workflow.

Unguaranteed Black Box Redaction (Preserve remaining text) Guaranteed Black Box Redaction (OCR remaining text) Complete Removal (Preserve remaining text)

Funnily enough, the current redact has VERY similar limitation to current redact (especially with /XObjects) so Unguaranteed/Guaranteed are not necessary because people know what to expect. Also default will stay the same. Therefore; the new redact features are "opt-in", further signaling to the users; they should take another look if they want to try it out, if they are satisfied the old redact they can continue to do the same thing they did before. The colour of the box can be adjusted, so no need for colour like black (it can be anything) :). For: Preserve remaining text/OCR remaining text are unnecessary, because remaining text is ALWAYS preserved sometimes because it was not "lost" in the first place (because best case scenario token removal does not affect remaining text), other times because it was OCRed back to its place. OCR also generally VERY reliable. I know, I know this is all very nitpicky :smile: but more generally:

I'm not in favor of these names because they're too implementation-specific and lengthy. Many users skip detailed labels anyway they test things out (e.g., upload a PDF) and move on if it works as intended, without needing elaborate explanations. Tying the UI to particular techniques like "OCR" or "Unguaranteed" limits the flexibility to evolve the underlying technology without causing front-end disruption. Instead, I am for simple, generic, outcome-oriented, technology-agnostic stuff to keep the experience stable.

Also, please consider we have more options than just hamburger menus, we can do checkboxes (E.g., "OCR" checkbox, PDF/A checkbox etc... although generally I'll try to keep UI simple, and try not to clutter stuff).

One important piece of information for you is also, most of time if the PDF was made with "proper" methods, the code can perform the token removal, I would personally say, even moderate will handle the majority of the PDFs people use no problem so right now we are discussing the UI for a minority of people who do all these things:

  • Opt-in for new feature
  • Opt in for specifically moderate redaction
  • Upload a PDF that was made with some tool that couldn't produce a PDF with standard encoding (like file to PDF conversion tool, or some web based tools things like that.)

Also for example I think majority of PDF in circulation are made from: LaTex (confirmed works well with Overleaf), Markdown (works well all with the tools tested), Word (I tested libreoffice, not sure about MS stuff, but libre also works well), Adobe, (works well, well most of the time, if it was adobe later 2007 edition), so all these I think do not need fallback and will get token removal, and imho majority of PDFs that people would use were created with one these tools. I don't think government/companies are sending pdfs that were made from for example HTML to PDF conversion or something.

Hope this helps/gives some perspective :)

As kind of PS, yes I know I haven't addressed all your point but sorry about that but, I think that's for today for me (It's 1:40 AM here), but to reiterate: thanks for your feedback and input, it will be definitely considered either way. :)

balazs-szucs avatar Aug 24 '25 23:08 balazs-szucs

You are absolutely right, I agree with every part you have just said. I was a bit tunnel visioned and off topic, sorry about that.

Like what Frooodle said, this is brilliant work, thank you!

christaikobo avatar Aug 24 '25 23:08 christaikobo

🚀 Translation Verification Summary

🔄 Reference Branch: pr-branch-messages_en_GB.properties

📃 File Check: messages_en_GB.properties

  1. Test Status:Passed
  2. Test Status:Passed
  3. Test Status:Passed

✅ Overall Check Status: Success

Thanks @balazs-szucs for your help in keeping the translations up to date.

stirlingbot[bot] avatar Aug 25 '25 21:08 stirlingbot[bot]

Hi,

I just noticed I messed up front-end messed due to the front-end fix PR; I'll fix it up.

balazs-szucs avatar Sep 24 '25 18:09 balazs-szucs

I am sorry to ask, what is the current status of this PR?

christaikobo avatar Nov 02 '25 21:11 christaikobo

Hi @christaikobo

It will be moved to a 3rd party dependency instead of this Java implementation. It will be available in V2 version, and won't be ported to V1. Expected release is this month, hopefully. Sorry for not updating.

balazs-szucs avatar Nov 02 '25 21:11 balazs-szucs

Thank you so much for the swift reply! That sounds exciting! Looking forward to it.

christaikobo avatar Nov 02 '25 21:11 christaikobo