I'm currently undertaking the ridiculously stupid task of using tesseract ocr to convert a scanned 1,000 page .pdf document into .txt documents (I volunteered to do it for a girl - I'm pathetic, I know). I figured out how to do this by reading a how-to written by "o.meyer" here
at howtoforge.com. Basically, I import one page of the .pdf at time in GIMP, convert it to indexed colors, save it as an uncompressed .tif, and then run tesseract in a terminal on the saved .tif, after which it spits out a .txt file (which I instruct tesseract to name after the page number of the .tif, for example 537.tif becomes 537.txt).
This process takes about a minute per page, so at 1,000 pages we're looking at 1,000 minutes. That's a lot of minutes.
I've looked into scripts some and I understand the very basics, but I don't know how to make a program, like GIMP, perform several specific functions in succession (import, convert to indexed colors, save as uncompressed .tif), and then have another program, like tesseract, perform its function after the completion of the prior program's function. I'm not trying to ask anyone to do my work for me, but could someone give me some advice regarding this, or point me in the direction of a newbie-friendly how-to that might help me out? I haven't found any. I'd like to finish this thing quickly, so as to maximize girl-impressing potential.
I asked for help at ubuntuforums.org but didn't receive any, so I looked harder and found a how-to for writing Gimp scripts, but it's more than I think I can learn before I'll need to have this project done.
I'd appreciate any kind of help that anyone can give me.