Creating DjVu Documents Linux HOWTO
version 0.2 (12 July 2006)
1 Synopsis
This document explains some of the uses of djvulibre implementation of DjVu for creating quality DjVu documents in linux. DjVu format features bitmap document compression and hypertext structure. It is used by numerous web sites all around the world for storing and distributing digital documents including scanned documents and high-resolution pictures. One of the advantages of DjVu files is that they are notably small, often smaller than PDF or JPEG files with the same content. This makes DjVu a helpful tool for digitizing books and journals, especially scientific ones.
Below it is considered the case when a DjVu document is created from a number of separate JPEG files each containing a single page. Here JPEG format is not a limitation, and the examples can cover arbitrary image formats. Conversion from PDF to DjVu is also discussed. Usage of scanner software is not explained: refer to the relevant documentation.
Requirements. The packages djvulibre, jpeg and netpbm are required. The packages sane and xpdf are highly recommended.
2 Creating DjVu
2.1 Scanning a book
Suppose the following situation for this section. We have a book which needs to be scanned and stored in a digital format. For the simplicity suppose that all the book contents is black and white (text, formulas, diagrams, etc.) except for the book cover which is printed in colour. What we normally can do is to scan it page by page an to store the pages separately in some image format, like JPEG or PDF. Personally, I believe JPEG is the best choice. But if you find, for instance, compressed TIFF more suitable for your purposes, this HOWTO might be of some help for you as well. However, in this case the example scripts should be slightly amended. For the time being, let us stick with JPEG.
In our situation with the book we scan the book front cover (and the back cover too, if it contains any noticeable text or pictures) to colour JPEG files. Then we scan the rest to black and white JPEGs. This should give the optimal performance. When saving the scanned images pay attention to the file names. For the purposes of conversion to DjVu all the images must be arranged alphabetically respecting the order of pages. For example,
is a right numbering; and
is a wrong one because 12.jpg will appear before 2.jpg. Once the entire book is scanned, place all the image files into a separate directory.
Depending on a scanner device, software and a method of scanning you may need to rotate all or just some of the JPEG images, usually following some simple pattern. The script jpegsrotate below can be quite handy in such a case. For example, run it with the parameter --even to turn even pages upside down in the current directory. The program jpegtran used in the script can rotate JPEGs only by 90, 180 or 270 degrees clockwise.
#!/bin/bash # # jpegsrotate # if [ -z `which jpegtran` ]; then usage echo "Error: jpegtran is needed" echo exit 1 fi shopt -s extglob DEFMASK="*.jpg" DEFEVENMASK="*[02468].jpg" DEFODDMASK="*[13579].jpg" DEFDEG=270 function usage() { echo echo "usage:" echo "$0" echo " rotates files with the mask $DEFMASK by $DEFDEG degrees clockwise" echo "$0 --even" echo " rotates even files with the mask $DEFEVENMASK by 180 degrees" echo "$0 --odd" echo " rotates odd files with the mask $DEFODDMASK by 180 degrees" echo "$0 --params \"REGEXP\" (90|180|270)" echo " rotates files with the mask REGEXP by the given aspect ratio clockwise" echo } if [ "$1" == "--even" ]; then MASK=$DEFEVENMASK DEG=180 elif [ "$1" == "--odd" ]; then MASK=$DEFODDMASK DEG=180 elif [ "$1" == "--params" ]; then if [ -n "$2" -a -n "$3" ]; then MASK=$2 DEG=$3 else usage exit 1 fi elif [ -n "$1" ]; then usage exit 1 else MASK=$DEFMASK DEG=$DEFDEG fi for i in $MASK; do if [ ! -e $i ]; then usage echo "Error: current directory must contain files with the mask $MASK" echo exit 1 fi echo "$i" jpegtran -rotate $DEG $i > $i.rotated mv $i.rotated $i done
2.2 JPEG to bitonal DjVu
When the images are ready, each of them needs to be converted to a separate page in DjVu format by a DjVu encoder, like cjb2 or cpaldjvu, and then the separate pages are to be bundled in a single DjVu document by djvm. Write the following script called any2djvu-bw somewhere, e.g. to ~/bin/. Run the script in the directory containing the source images to convert separate black and white pages.
#!/bin/bash # # any2djvu-bw # if [ -z `which anytopnm` -o -z `which ppmtopgm` -o -z `which pgmtopbm`\ -o -z `which cjb2` ]; then usage echo "Error: anytopnm, ppmtopgm, pgmtopbm and cjb2 are needed" echo exit 1 fi shopt -s extglob DEFMASK="*.jpg" DPI=300 # uncomment the following line to compile a bundled DjVu document #OUTFILE="#0-bw.djvu" function usage() { echo echo "usage:" echo echo "$0 [\"REGEXP\"]" echo " converts single pages with the default mask $DEFMASK (or REGEXP if provided)" echo " in the current directory to single-page black and white djvu documents" # uncomment the following line to compile a bundled DjVu document # echo " and bundles them as a djvu file $OUTFILE" echo } if [ -n "$1" ]; then MASK=$1 else MASK=$DEFMASK fi for i in $MASK; do if [ ! -e $i ]; then usage echo "Error: current directory must contain files with the mask $MASK" echo exit 1 fi if [ ! -e $i.djvu ]; then echo "$i" anytopnm $i | ppmtopgm | pgmtopbm -value 0.499 > $i.pbm # in netpbm >= 10.23 the above line can be replaced with the following: # anytopnm $i | ppmtopgm | pamditherbw -value 0.499 > $.pbm cjb2 -dpi $DPI $i.pbm $i.djvu rm -f $i.pbm fi done # uncomment the following line to compile a bundled DjVu document #djvm -c $OUTFILE $MASK.djvu
If you run the script as
$ ~/bin/any2djvu-bw
it will take the default action and try to convert all the images *.jpg in the current directory to single page DjVu files with the extension .jpg.djvu. You can change this behaviour by defining a file mask (the optional parameter). The dithering value 0.499 was obtained experimentally and represents a very good (if not the best) setting for bitonal images. You also can uncomment the indicated lines in any2djvu-bw to compile the final bundled black and white DjVu document in a single run of the script. If you did so and if you do not need any colour pages, you may skip reading the next subsection telling about conversion of colour images.
2.3 JPEG to low colour DjVu
Next, we need to convert colour images taken from the front and back book covers. Suppose the front cover is stored in 000.jpg, and the back cover is stored in 999.jpg, and each of them contain not more than, say, 8 tones. The previous run of any2djvu-bw left two unwanted DjVu files after it, namely black and white versions 000.jpg.djvu and 999.jpg.djvu. Delete these two files. Then convert both 000.jpg and 999.jpg to colour DjVu pages by executing the following command (note, quotation marks are necessary):
$ ~/bin/any2djvu-low "+(000|999).jpg" 8
where any2djvu-low is the script given below which must be written to ~/bin/ in order to execute the command.
#!/bin/bash # # any2djvu-low # if [ -z `which cpaldjvu` ]; then usage echo "Error: cpaldjvu is needed" echo exit 1 fi shopt -s extglob DEFMASK="*.jpg" DPI=300 DEFNCOLORS=256 # uncomment the following line to compile a bundled DjVu document #OUTFILE="#0-low.djvu" function usage() { echo echo "usage:" echo echo "$0 [\"REGEXP\" [INT]]" echo " converts single pages with the default mask $DEFMASK (or REGEXP if provided)" echo " in the current directory to single-page low colour djvu documents with the" echo " number of colours $DEFNCOLORS (default) or INT (if provided)" # uncomment the following line to compile a bundled DjVu document # echo " and bundles them as a djvu file $OUTFILE" echo } if [ -n "$1" ]; then MASK=$1 if [ -n "$2" ]; then NCOLORS=$2 else NCOLORS=$DEFNCOLORS fi else MASK=$DEFMASK NCOLORS=$DEFNCOLORS fi for i in $MASK; do if [ ! -e $i ]; then usage echo "Error: current directory must contain files with the mask $MASK" echo exit 1 fi if [ ! -e $i.djvu ]; then echo "$i" cpaldjvu -dpi $DPI -colors $NCOLORS $i $i.djvu fi done # uncomment the following line to compile a bundled DjVu document #djvm -c $OUTFILE $MASK.djvu
Colour DjVu pages were produced by a low colour encoder cpaldjvu rather than by a bitonal encoder cjb2. Occasionally cpaldjvu with the 2 colour setting may produce slightly smaller output files comparing to that of cjb2. This may happen since the black colour appears to be lighter in the case of cpaldjvu. Therefore the usage of cjb2 is preferable for bitonal images which usually look nicer the brighter the black colour. In addition, conversion of a JPEG image to a bitonal DjVu using cpaldjvu takes approximately 1.5 times longer than the same thing using cjb2.
You might also expect that cpaldjvu (with the default number of colours 256) would produce an output almost the same in size as the initial (even 16M colour) JPEG file. Reducing the number of colours using the option -colors n of cpaldjvu in many cases solves the problem exponentially slow, for example, reducing n from 256 to 16 can give an output only 4 times smaller.
2.4 Binding DjVu
The final step is to bind all the separate DjVu pages into a multi-page DjVu document. The following script binddjvu does the thing.
#!/bin/bash # # binddjvu # shopt -s extglob OUTFILE="#0.djvu" DEFMASK="*.jpg.djvu" if [ -n "$1" ]; then MASK=$1 else MASK=$DEFMASK fi djvm -c $OUTFILE $MASK
The multi-page DjVu file #0.djvu can be given some better, meaningful name:
$ mv #0.djvu thebook.djvu
And we are done with our example.
2.5 PDF to DjVu
PDF format is also used for digitizing documents, e.g. by jstor.org, and at present is still better wide-spread than DjVu only for the reason that many people have programs for reading PDF and don't have anything for reading DjVu. There are several reasons to replace PDF with DjVu, including the following:
- On scanned documents the performance of DjVu is strictly better than that of PDF. This is why it makes sense to convert a scanned PDF document to DjVu format.
- There is another kind of situation when we have many (single-page) PDF documents which we want to bind together. For example, take pages of a PDF document downloaded from an internet library.
- Merging PDFs, single-page or multi-page, into a single DjVu.
- Also on some scanners it's possible to scan directly to single-page PDF files. Then again it is convenient to bind PDFs in a multi-page DjVu.
The following script pdfs2djvu suffices for each of the above actions. By default pdfs2djvu takes all the *.pdf files in the current directory in the alphabetical order and produces a single multi-page bitonal DjVu file #0.djvu.
#!/bin/bash # # pdfs2djvu # if [ -z `which pdftoppm` -o -z `which cjb2` -o -z `which djvm` ]; then echo echo "Error: pdftoppm, cjb2 and djvm are needed" echo exit 1 fi shopt -s extglob OUTFILE="#0.djvu" DEFMASK="*.pdf" DPI=600 if [ -n "$1" ]; then MASK=$1 else MASK=$DEFMASK fi for PDF in $MASK; do if [ ! -e $PDF ]; then echo echo "Error: current directory must contain files with the mask $MASK" echo exit 1 fi echo $PDF pdftoppm -mono -r 600 -aa yes $PDF $PDF for PBM in $PDF*.pbm; do echo $PBM cjb2 -dpi $DPI $PBM $PBM.djvu rm -f $PBM done done djvm -c $OUTFILE $MASK*.pbm.djvu
After a run the script pdfs2djvu leaves DjVu-encoded pages as files *.pbm.djvu in the current directory.
3 Concluding remarks
This HOWTO was written not by a developer of DjVu but by its user. Therefore the HOWTO possibly lacks some technical details. If you wish to get more technical information on commands, see manpages or any other relevant documentation. I would suggest a very instructive
$ man djvu
to anybody beginning to use djvulibre on linux.
Author: Vladimir Komendantsky <MY_LASTNAME at gmail.com>