Optical Character Recognition With Tesseract OCR On Ubuntu 7.04
Optical Character Recognition With Tesseract OCR On Ubuntu 7.04Version 1.0 This document describes how to set up Tesseract OCR on Ubuntu 7.04. OCR means "Optical Character Recognition". The resulting system will be able to convert images with embedded text to text files. Tesseract is licensed under the Apache License v2.0. This howto is meant as a practical guide; it does not cover the theoretical backgrounds. They are treated in a lot of other documents in the web. This document comes without warranty of any kind! I want to say that this is not the only way of setting up such a system. There are many ways of achieving this goal but this is the way I take. I do not issue any guarantee that this will work for you!
1 PreparationSet up a basic Ubuntu 7.04 system and update it. Get scanned images or scan documents yourself. If you use a scanner, be sure that it is supported by sane. A list of supported devices is vailable at http://www.sane-project.org/.
2 Get ImagemagickThe current version of tesseract provided in the Ubuntu repositories supports only uncompressed and G3-compressed tiff files. To ensure, that tesseract is able to process your images, you should convert them to uncompressed tiff. Since conversions with Gimp to uncompressed tiff were unusable, I used the convert tool, which is supplied by the Imagemagick package. Install Imagemagick from the Ubuntu repositories with the Synaptic Package Manager.
3 Get TesseractInstall the packages tesseract-ocr and tesseract-ocr-data from the Ubuntu repositories with the Synaptic Package Manager.
4 Prepare ImagesTo get the best results from tesseract, you have to optimize the images. I recommend the use of images with a minimum resolution of about 200dpi. I used Gimp for the following steps 4.1 - 4.3.
4.1 CleaningRemove any non-alphanumeric content from the image to prevent tesseract from producing chaotic text blocks. That can be done easily with the erase-tool within Gimp.
4.2 ThresholdConvert the image to RGB or Greyscale mode. Within gimp: Image - Mode - RGB or Grayscale Use the threshold function to reduce biased lighting and remove fragments. Move the sliders to define the delimitation of bright and dark areas. Have a look at the preview while you are doing this to see the effects on the image. Within Gimp: Tools - Color Tools - Threshold
4.3 Black And WhiteTo improve the text recognition, we reduce the colors to black an white by switching the image to indexed mode. Within Gimp: Image - Mode - Indexed Be sure to turn off dithering. Save the image after this step.
5 Convert To TiffNow you have to convert the image to uncompressed tiff. convert %source_file% %destination_file% e.g.: convert document.jpg document.tif
6 Use TesseractAt this point all preparations are completed, so you can start using tesseract. tesseract %tiff_file% %name_for_resulting_files% e.g.: tesseract document.tif result Tesseract adds the file extensions for the resulting files itself. In this example tesseract would create result.txt, result.map and result.raw .
Links
|








Recent comments
6 hours 24 min ago
7 hours 11 min ago
11 hours 33 min ago
12 hours 29 min ago
18 hours 14 min ago
19 hours 25 min ago
22 hours 56 min ago
1 day 2 hours ago
1 day 3 hours ago
1 day 4 hours ago