AT YOUR SERVICE:
By Julie Gable
Getting high-quality paper documents onto the Web involves trade-offs. Cost, quality and file size are interwoven issues that can tangle decision making. New products with better compression algorithms for text and photographs can help, but not without a price. Here's what to consider.
Is the document collection's value greater than the cost to convert it to a Web-ready format? The PDF developed by Adobe, San Jose, CA, is an easy choice for converting paper documents to the Web because it retains the original's look and feel. With PDF, however, the conversion costs rise with the need for smaller files. Here's why.
Acrobat can produce three file types: PDF Image Only, PDF Normal and PDF Image Plus Text. [The last two have been renamed PDF Formatted Text and Graphics files and PDF Searchable Image files, respectively, in Acrobat Capture 3.0.] According to Tony McKinley, author of "Paper to the Web," a typical page of text scanned at 300 dpi and compressed with Group 4 is 50 K. When converted to an Image Only file, the PDF "wrapper" adds another 5 K for a file size of about 55 K per page. Any user with Acrobat Reader can view the image.
Image files converted to PDF normal undergo optical character recognition (OCR) so that the text becomes searchable, and the image file gets discarded. The resulting file size of about 10 K per page is easily downloadable. One difficulty with OCR on a PDF file is that it is correctable but not editable; that is, changes to the OCR text will not automatically flow from line to line as they would with a word processing file. As with any OCR job, the file requires cleanup, a manual process that adds costs to conversion.
PDF Image Plus Text format retains the image file and places the converted text file behind it. The user searches the text file but actually sees the image. The trade-off here is that the text file may not require as much cleanup, but the file size is about 80 K to 90 K per page. A 100-page document scanned to this format will take ages to arrive across a LAN, will require several minutes to open and exceed the capacity of a floppy disk.
PDF may not be the right choice if the documents to be scanned contain many color or halftone photographs. The scanning resolution of 300 dpi needed for high-contrast areas such as text is actually higher than the 72 dpi to 100 dpi resolution needed for a photograph's continuous tone. The photos don't compress well using Group 4, so the scanned page file can be a megabyte or more even after compression. [Adobe has improved on color and graphics compression with Acrobat Capture 3.0, but not enough to defeat bandwidth challenges.] Conversely, JPEG compression, which does pixel averaging for photos, erodes text quality. PDF Normal will reduce file size in this instance, but the trade-off is the labor involved in OCR cleanup.
In another alternative, a service bureau can scan pages once to capture text, then again on a color scanner to capture the photo. The photo's color image is clipped with a tool like Photoshop and reinserted into the text page. The manual effort can drive up conversion costs to $5 or $6 per page. Tom Johnson, president of Root Technologies, Princeton, NJ, faced this situation in converting The Journal of the Acoustical Society for the Web.
"The back issues of the journal, from 1926 to 1996, had 200,000 pages, with many color and halftone photographs," says Johnson. "We needed a way to produce 75 K files suitable for Web viewing."
Root chose DjVu from LizardTech, Seattle. DjVu was developed by AT&T Labs expressly for scanning pages with text characters and pictures. Once scanned, the two object types are placed in separate layers and compressed with different methods. Both are lossy methods, but do not affect document readability. DjVu also eliminates redundant character information. For example, if the character "e" appears 100 times, DjVu can store the compressed image of the character once, with 99 pointers to its other locations. The resulting files are 30 percent smaller than TIFF images or PDF Image Only files, without the need for the labor-intensive work associated with producing PDF Normal or color image reinsertion. For color pages at 300 dpi that contain text and pictures, DjVu files are generally five to eight times smaller than GIF or JPEG.
"DjVu's new searchable text feature makes it the format of choice for scanning document images to the Web," says James Rile, president of independent consulting firm Rile Associates, Phoenixville, PA. Rile has done extensive work with PDF, but he found that a 32-page, full-color magazine averaged 40.6 K per page with DjVu.
So what's the trade-off? Displaying DjVu files requires the DjVu viewer (free at http://www.djvu.com/ or http://www.lizardtech.com/ as a one-click download). But the viewer works only in a Web browser, not on the desktop like Adobe Acrobat Viewer, and without Acrobat Viewer's navigation features. To create DjVu files, LizardTech sells an enterprise version for $7,000 per CPU and a personal version for $250. A workgroup version is planned.
DjVu offers an alternative to costlier conversion methods, potentially tipping the balance for placing high-quality content on the Web. Its smaller file sizes will appeal to users with little patience for slow downloads, and they'll be rewarded with excellent text and photo quality. Whether users will judge a site's content worth the effort of loading a special viewer remains to be seen.
Julie Gable (Juliegable@aol.com), CDIA, LIT, is an independent consultant. Product mention should not be construed as an endorsement.