Optical Character Recognition With Tesseract OCR On Ubuntu 7.04

Want to support HowtoForge? Become a subscriber!
 
Submitted by o.meyer (Contact Author) (Forums) on Tue, 2007-08-28 16:56. :: Ubuntu | Desktop

Optical Character Recognition With Tesseract OCR On Ubuntu 7.04

Version 1.0
Author: Oliver Meyer <o [dot] meyer [at] projektfarm [dot] de>
Last edited 08/23/2007

This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR means "Optical Character Recognition". The resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0.

This howto is meant as a practical guide; it does not cover the theoretical backgrounds. They are treated in a lot of other documents in the web.

This document comes without warranty of any kind! I want to say that this is not the only way of setting up such a system. There are many ways of achieving this goal but this is the way I take. I do not issue any guarantee that this will work for you!

 

1 Preparation

Set up a basic Ubuntu 7.04 system and update it.

Get scanned images or scan documents yourself.

If you use a scanner, be sure that it is supported by sane. A list of supported devices is vailable at http://www.sane-project.org/.

 

2 Get Imagemagick

The current version of tesseract provided in the Ubuntu repositories supports only uncompressed and G3-compressed tiff files.

To ensure, that tesseract is able to process your images, you should convert them to uncompressed tiff.

Since conversions with Gimp to uncompressed tiff were unusable, I used the convert tool, which is supplied by the Imagemagick package.

Install Imagemagick from the Ubuntu repositories with the Synaptic Package Manager.

 

3 Get Tesseract

Install the packages tesseract-ocr and tesseract-ocr-data from the Ubuntu repositories with the Synaptic Package Manager.

 

4 Prepare Images

To get the best results from tesseract, you have to optimize the images. I recommend the use of images with a minimum resolution of about 200dpi.

I used Gimp for the following steps 4.1 - 4.3.

 

4.1 Cleaning

Remove any non-alphanumeric content from the image to prevent tesseract from producing chaotic text blocks.

That can be done easily with the erase-tool within Gimp.

 

4.2 Threshold

Convert the image to RGB or Greyscale mode.

Within gimp:

Image - Mode - RGB or Grayscale

Use the threshold function to reduce biased lighting and remove fragments. Move the sliders to define the delimitation of bright and dark areas. Have a look at the preview while you are doing this to see the effects on the image.

Within Gimp:

Tools - Color Tools - Threshold

 

4.3 Black And White

To improve the text recognition, we reduce the colors to black an white by switching the image to indexed mode.

Within Gimp:

Image - Mode - Indexed

Be sure to turn off dithering.

Save the image after this step.

 

5 Convert To Tiff

Now you have to convert the image to uncompressed tiff.

convert %source_file% %destination_file%

e.g.:

convert document.jpg document.tif

 

6 Use Tesseract

At this point all preparations are completed, so you can start using tesseract.

tesseract %tiff_file% %name_for_resulting_files%

e.g.:

tesseract document.tif result

Tesseract adds the file extensions for the resulting files itself. In this example tesseract would create result.txt, result.map and result.raw .

 

Links


Please do not use the comment function to ask for help! If you need help, please use our forum.
Comments will be published after administrator approval.
Submitted by Tamas KOOS (not registered) on Sun, 2009-12-06 20:36.

Thanks a lot, Oliver!

 

You helped me a lot!!! 

Even, if I OCR-d Hungarian text... it HELPED.

 

Friendly thanks and greetings,

 

Tama

Submitted by Thomas Kent (not registered) on Fri, 2009-10-23 16:46.

A couple updates that I wanted to add:

1) If you run on a modern ubuntu (I tried with 9.04), you need to install the tesseract-ocr-eng (or another language instead of -eng) data file instead of tesseract-ocr-data. (If you specify the package that ends in -eng, you don't have to specify the other package, it will be automatically installed because it is a dependancy).

2) tesseract seems to care very much about two things: Being in TIFF format and using 1-bit color. I got the exact same results with and without doing steps 4.1 and 4.2, but 4.3 was essential. Additionally, you can use GIMP to create TIFF documents as well. I only tried making uncompressed and lzw ones. What I'd like to do, but probably won't get time, is to use some command line tools like ImageMagic and libtiff to make a script that will take any kind of image file, do the necessary conversion to 1-bit and tiff, then pass it directly to tesseract without me having to jump in the middle; it seems like it should be very easy.

Submitted by Tenna (not registered) on Thu, 2009-06-04 02:36.

I forgot to mention that I'm using Debian Lenny.

Bye.

Submitted by Tenna (not registered) on Thu, 2009-06-04 02:34.

I use Gimp 2.4.7, Acrobat Reader 9 for Linux.

I had to convert a scanned pdf document to text so your howto help me a lot. Thanks for that.

Just a couple of things I made a little diferent:

1. I selected and copied the text image from Acrobat in a new image from Gimp that I previously open with Advanced Option where I specified Grayscale and 200ppi.

2. I Edit > Paste and then scaled the Layer with Layer > Scale Layer until I could see very well the letters and I cleaned the picture out of dots, etc as you recommended.

3. I followed your direction in Image > Mode >Index but I didn't do the threshold thing, just Color > Brightness & Contrast. There I put Brightness at -127 and Contrast al 127 (of course this depends on how good you can see the image).

4. File > Save as... Here I put .tif at the end of my chosen name. Here it ask me to Export and you have to choose Flatten Image.

5. Then as you indicated:

$ tesseract   filename.tif  result

I just want to thank you again. 

Submitted by August Guillaume (not registered) on Wed, 2010-06-30 19:02.

I prepared a Ubuntu script using:

Sane, Gimp 2.6, Convert, Tesserat, gedit

Gimp 2.6 has its own script, but I found that it was very difficult and time consuming to correct errors. Gimp item is therefore manual. The script opens it automatically and all I have to do is mode / index / black and white/ and be sure it is saved a jpeg again (grey scale) and close gimp again.

I use gedit to show the result.

If you are interested, please e-mail me at aguilla1@telusplanet.net

 

Submitted by liotier (registered user) on Fri, 2007-08-31 14:23.

Since Tesseract is a command-line application, it might have been coherent to recommend the use of command-line tools in the preparation phase. That way, the whole process could be more automated.

 

Scanning using SANE command line front-ends :

 

Image preparation using Imagemagick :

 

Actually you used Imagemagick in your 'convert' command-line - you should have mentioned that, and while you are at it you can go all the way and avoid using Gimp entirely. Less fiddling with the clickodrome, more automation !