Automatically redact sensitive information from PDFs?


#1

I’m working on going paperless and have a few years of credit card statements and other documents with magic numbers I would not like to have in the cloud. I’m scanning all of my documents into a home server running Ubuntu Server and running ocr on everything. I want this whole process to be as automated as possible. Here is my ideal workflow.

I scan everything into a folder in the server and run ocr.

Each file gets processed and categorized (utility bill, credit card statement). Account numbers and sensitive information removed/redacted. Renamed based on a naming convention and placed into a a folder hierarchy, as well as automatically backed up.


#2

These web service based solutions may be worth a closer look. Both should be viable options for Ubuntu.

The first is GNU Affero licensed while the second is commercial.

Handoff to other platforms might also open the field a little; e.g. PDFpen 6+ supports redaction via AppleScript - Smile.

Hope there’s something you can make use of there.


#3

Sending to a web service sort of defeats the purpose of redaction. I don’t have a use case but I’m sure my company, for one, would frown on the use of such a service.


#4

Agreed, but web services can be hosted on internal networks with no exposure to the Internet. The OP does state he is running Ubuntu Server, so I with these options I was more suggesting that the web service would be deployed and accessed locally on that server.

I believe you work for a certain TLA with a large colourful nickname, so I’d be surprised if your employer didn’t deploy at least a handful of web services internally and perhaps even deploy such solutions to clients for their internal use. :thinking:


#5

I’m rumbled @sylumer. I’m sure Little Lilac :slight_smile: has tons of internal web services. :slight_smile:

I just wanted to caution people about some of the considerations for sending stuff to a web service. It would be a no-no in Little Lilac and it should be a no-no for everyone else if it were an external web service of unknown provenance.


#6

Yeah I want to do everything locally. I have a feeling that using my redaction stamp will be easier in the long run.


#7

You worried me with the word “stamp”.

If you’re simply overlaying the text that isn’t safe. You need to replace the text with e.g. black blob characters.


#8

One of these guys

https://www.officedepot.com/a/products/994172/2000-Plus-Self-Inking-Security-Stamp/