University of Louisiana Digital Library
PDF Normal from Image compared to DjVu with hidden text
This comparison is made using academic articles published at the University of Louisiana Digital Library in PDF Normal format. The PDF files were created between 1996 - 1998. The Normal form of PDF is the smallest form of PDF that can be created from scanned page images.
In the Normal form of PDF, the text is converted to visible ASCII text with fonts and point sizes and text attributes such as bolding and italics. Graphic illustrations, equations and symbols are left as graphic objects in the PDF. These graphic objects are "snippets" of the original complete scanned page image. By having only snippets and not the complete image in the page, the file size is smaller in PDF Normal.
The downside of PDF Normal is that it is impossible to represent the layout of the original page with complete fidelity. Also, defects can occur in the ASCII text presentation, as you can see in these PDF examples.
DjVu takes a different approach than PDF Normal in producing a small full-text-searchable file. High-contrast analysis is performed to separate background graphics and foreground text. The text is compressed with the JB2 compression method that is superior to supported PDF methods for bitonal images. The graphics are copressed with IW44 which is superior to the supported PDF methods for graphics. The background and foreground layers are merged when the image is presented in the viewer.
Like PDF Image + Text (now called Searchable Image PDF), DjVu has a searchable text layer hidden under the image, and in JRASearch both forms can be searched with resulting search term highlighting.
In the table below, click on the size to open that file:
Note that the DjVu file for cdx01240 is OCRed in three languages: English, German and French!
PDF Color Image-Only compared to DjVu Photo and Segmented
Historic weather logs from the University of Louisiana are presented as JPEG-compressed images in a "PDF Wrapper". We compare these to DjVu Photo (background IW44 compression only) and DjVu Segmented (background and foreground layers).
The defects in the DjVu Segmented version (text in the background) will be eliminated when the uncompressed color image file is used for DjVu encoding. In this presentation, the compressed color image file in the PDF was used, limiting the effectiveness of the segmenter. For an example of good segmentation of handwritten text using uncompressed color images as input, see: http://www.planetdjvu.com/gallery/jones.djvu.
|