Bas Grolleman’s Blog

Place where I put my written thoughts, though I usually just make video’s these days

Introduction

I hate paper, it’s never where I need it, lacks a search function and in general takes up space. So getting rid of it was a long time quest on my list. Now I though of burning it but in the end settled for a more civilized option.

 

Unfortunately, as soon as you put a document through the shredder you will get a call asking exactly for that information. So, I needed a work-flow to digitize my documents before going all psycho on them.

Collecting

I don’t always have my scanner setup, I lack a nice flat  surface where it could always be available to me. As I want to get rid of stuff asap I setup the next best thing, a parking spot.

Processing

Once every month (or 2, depending on how much I can postpone stuff), I collect everything from the parkingspot and prepare for scanning. So I get my trusted ADF  equipped HP Printer. While you could get something like a Snapscan, I already spend all my money on videogames and this is available.

 

 

For the scanning I use gscan2pdf, it’s a nice tool that allows you to do a lot of scanning in a row. It has options for OCR but only adds this to notes, that’s not good enough for me.  

After a lot of testing I found that using  Lineart at 300DPI, gives good results and doesn’t generate huge files.

Of course, I you got full color holiday cards send by mother, I would keep those  separate  and scan them later using a color setting.

I love computers that are working for me, so I just fill the ADF  with documents and switch on scanning. I get some coffee and stand next to it, going, “Hmm, yeah, HP did you file your TPS reports last week?”

Once this done, you should have a long list of files in gscan2pdf, now all I do is Save  and pick PNG. Select a directory and name to generate a nice directory full of files.

A good tip is to re-order the files in gscan2pdf, as it’s much easier than doing it by hand later.

Adding OCR and creating PDF files

So, now we have the .png files, we need to generate PDF’s from them. I created a small bash script to do this. It will call scripts to OCR a single or group of PNG files and generate a nice PDF ready to upload to Google Docs.

https://github.com/bgrolleman/png2ocrpdf

Now the following is still on the prompt, but I would like to add this to Nautilus scripts so you can just select a group of files and say “OCR and PDF”

For now, usage is like follows

png2ocrpdf -l ‘eng’ -t ‘Hello World’ -a ‘Me’ Hello_01.png Hello_02.png Hello_03.png

This will use English OCR and create a file ‘Hello World.pdf’ with 3 pages

Script isn’t supporting spaces for the input filenames yet

Final step

Once the .pdf files are generated, you can upload it to Google Docs or Evernote to find later. I use Google Docs and switch off all conversion options. This will result in a nice readable PDF file online that’s fully searchable. And available on any laptop or my phone so I can access it anywhere and anytime.