Add new comment

Do you like HowtoForge? Please consider supporting us by becoming a subscriber.
Submitted by Tenna (not registered) on Thu, 2009-06-04 02:34.

I use Gimp 2.4.7, Acrobat Reader 9 for Linux.

I had to convert a scanned pdf document to text so your howto help me a lot. Thanks for that.

Just a couple of things I made a little diferent:

1. I selected and copied the text image from Acrobat in a new image from Gimp that I previously open with Advanced Option where I specified Grayscale and 200ppi.

2. I Edit > Paste and then scaled the Layer with Layer > Scale Layer until I could see very well the letters and I cleaned the picture out of dots, etc as you recommended.

3. I followed your direction in Image > Mode >Index but I didn't do the threshold thing, just Color > Brightness & Contrast. There I put Brightness at -127 and Contrast al 127 (of course this depends on how good you can see the image).

4. File > Save as... Here I put .tif at the end of my chosen name. Here it ask me to Export and you have to choose Flatten Image.

5. Then as you indicated:

$ tesseract   filename.tif  result

I just want to thank you again. 

Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.

Reply

*
*
The content of this field is kept private and will not be shown publicly.


*

  • Images can be added to this post.
  • Allowed HTML tags: <a> <em> <strong> <cite> <code> <ul> <ol> <li> <dl> <dt> <dd> <img> <div>
  • Lines and paragraphs break automatically.