Introduction
I hate paper, it’s never where I need it, lacks a search function and in general takes up space. So getting rid of it was a long time quest on my list. Now I though of burning it but in the end settled for a more civilized option.
Unfortunately, as soon as you put a document through the shredder you will get a call asking exactly for that information. So, I needed a work-flow to digitize my documents before going all psycho on them.
Collecting
I don’t always have my scanner setup, I lack a nice flat surface where it could always be available to me. As I want to get rid of stuff asap I setup the next best thing, a parking spot.
Processing
Once every month (or 2, depending on how much I can postpone stuff), I collect everything from the parkingspot and prepare for scanning. So I get my trusted ADF equipped HP Printer. While you could get something like a Snapscan, I already spend all my money on videogames and this is available.
For the scanning I use gscan2pdf, it’s a nice tool that allows you to do a lot of scanning in a row. It has options for OCR but only adds this to notes, that’s not good enough for me.
After a lot of testing I found that using Lineart at 300DPI, gives good results and doesn’t generate huge files.
Of course, I you got full color holiday cards send by mother, I would keep those separate and scan them later using a color setting.
I love computers that are working for me, so I just fill the ADF with documents and switch on scanning. I get some coffee and stand next to it, going, “Hmm, yeah, HP did you file your TPS reports last week?”
Once this done, you should have a long list of files in gscan2pdf, now all I do is Save and pick PNG. Select a directory and name to generate a nice directory full of files.
A good tip is to re-order the files in gscan2pdf, as it’s much easier than doing it by hand later.
Adding OCR and creating PDF files
So, now we have the .png files, we need to generate PDF’s from them. I created a small bash script to do this. It will call scripts to OCR a single or group of PNG files and generate a nice PDF ready to upload to Google Docs.
https://github.com/bgrolleman/png2ocrpdf
Now the following is still on the prompt, but I would like to add this to Nautilus scripts so you can just select a group of files and say “OCR and PDF”
For now, usage is like follows
png2ocrpdf -l ‘eng’ -t ‘Hello World’ -a ‘Me’ Hello_01.png Hello_02.png Hello_03.png
This will use English OCR and create a file ‘Hello World.pdf’ with 3 pages
Script isn’t supporting spaces for the input filenames yet
Final step
Once the .pdf files are generated, you can upload it to Google Docs or Evernote to find later. I use Google Docs and switch off all conversion options. This will result in a nice readable PDF file online that’s fully searchable. And available on any laptop or my phone so I can access it anywhere and anytime.