Optical Character Recognition With Tesseract OCR On Ubuntu 7.04
Author: Oliver Meyer <o [dot] meyer [at] projektfarm [dot] de>
This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR means "Optical Character Recognition". The resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0.
This howto is meant as a practical guide; it does not cover the theoretical backgrounds. They are treated in a lot of other documents in the web.
This document comes without warranty of any kind! I want to say that this is not the only way of setting up such a system. There are many ways of achieving this goal but this is the way I take. I do not issue any guarantee that this will work for you!
Set up a basic Ubuntu 7.04 system and update it.
Get scanned images or scan documents yourself.
If you use a scanner, be sure that it is supported by sane. A list of supported devices is vailable at http://www.sane-project.org/.
2 Get Imagemagick
The current version of tesseract provided in the Ubuntu repositories supports only uncompressed and G3-compressed tiff files.
To ensure, that tesseract is able to process your images, you should convert them to uncompressed tiff.
Since conversions with Gimp to uncompressed tiff were unusable, I used the convert tool, which is supplied by the Imagemagick package.
Install Imagemagick from the Ubuntu repositories with the Synaptic Package Manager.
3 Get Tesseract
Install the packages tesseract-ocr and tesseract-ocr-data from the Ubuntu repositories with the Synaptic Package Manager.
4 Prepare Images
To get the best results from tesseract, you have to optimize the images. I recommend the use of images with a minimum resolution of about 200dpi.
I used Gimp for the following steps 4.1 - 4.3.
Remove any non-alphanumeric content from the image to prevent tesseract from producing chaotic text blocks.
That can be done easily with the erase-tool within Gimp.
Convert the image to RGB or Greyscale mode.
Image - Mode - RGB or Grayscale
Use the threshold function to reduce biased lighting and remove fragments. Move the sliders to define the delimitation of bright and dark areas. Have a look at the preview while you are doing this to see the effects on the image.
Tools - Color Tools - Threshold
4.3 Black And White
To improve the text recognition, we reduce the colors to black an white by switching the image to indexed mode.
Image - Mode - Indexed
Be sure to turn off dithering.
Save the image after this step.
5 Convert To Tiff
Now you have to convert the image to uncompressed tiff.
convert %source_file% %destination_file%
convert document.jpg document.tif
6 Use Tesseract
At this point all preparations are completed, so you can start using tesseract.
tesseract %tiff_file% %name_for_resulting_files%
tesseract document.tif result
Tesseract adds the file extensions for the resulting files itself. In this example tesseract would create result.txt, result.map and result.raw .