Automatically redact sensitive information from PDFs?

I’m working on going paperless and have a few years of credit card statements and other documents with magic numbers I would not like to have in the cloud. I’m scanning all of my documents into a home server running Ubuntu Server and running ocr on everything. I want this whole process to be as automated as possible. Here is my ideal workflow.

I scan everything into a folder in the server and run ocr.

Each file gets processed and categorized (utility bill, credit card statement). Account numbers and sensitive information removed/redacted. Renamed based on a naming convention and placed into a a folder hierarchy, as well as automatically backed up.

These web service based solutions may be worth a closer look. Both should be viable options for Ubuntu.

The first is GNU Affero licensed while the second is commercial.

Handoff to other platforms might also open the field a little; e.g. PDFpen 6+ supports redaction via AppleScript - Smile.

Hope there’s something you can make use of there.

Sending to a web service sort of defeats the purpose of redaction. I don’t have a use case but I’m sure my company, for one, would frown on the use of such a service.

Agreed, but web services can be hosted on internal networks with no exposure to the Internet. The OP does state he is running Ubuntu Server, so I with these options I was more suggesting that the web service would be deployed and accessed locally on that server.

I believe you work for a certain TLA with a large colourful nickname, so I’d be surprised if your employer didn’t deploy at least a handful of web services internally and perhaps even deploy such solutions to clients for their internal use. :thinking:

I’m rumbled @sylumer. I’m sure Little Lilac :slight_smile: has tons of internal web services. :slight_smile:

I just wanted to caution people about some of the considerations for sending stuff to a web service. It would be a no-no in Little Lilac and it should be a no-no for everyone else if it were an external web service of unknown provenance.

Yeah I want to do everything locally. I have a feeling that using my redaction stamp will be easier in the long run.

You worried me with the word “stamp”.

If you’re simply overlaying the text that isn’t safe. You need to replace the text with e.g. black blob characters.

One of these guys