Embedding Transcripts in Handwritten Pages
a groundbreaking demonstration
by James Rile, PlanetDjVu and Jeffery Triggs, Rutgers University, September 23, 2003
It is well known that OCR (Optical Character Recognition) can only be performed on images of printed text. Contemporary OCR engines lack the sophistication to recognize most handwritten text (unlike humans!). So there is no way to automatically generate machine-readable (ASCII) text for a handrwitten page, whether embedded or not.
A transcript is an externally created rendition of handwriting using machine-readable text, created by a human, and saved as a text document (in a format such as Word or HTML).
An embedded transcript text is initially created externally by a human, but is then inserted and mapped to an image of the handwritten page, as if it were OCRed.
What are the benefits of embedded transcripts vs. external transcripts?
There are three:
1. You can put the cursor over a word you have trouble reading, and the transcript of that word will appear in a tooltip. You can also display the entire line of text for better contextual understanding.
2. You can copy and paste selected text from handwritten pages to another computer document, such as a Word file, as part of your research and knowledge discovery activities.
3. You can perform search and retrieval operations on handwritten pages, allowing handwritten pages to be integrated with printed pages in one archive.
About the Jefferson Letter
We are presenting here an example of handwritten pages with embedded transcripts. This is the well-known Jefferson letter to Angelica Schuyler Church from the DjVu Zone website, based on an original at the University of Virginia Library.
How It Was Done
The transcripts were embedded in these pages using the djedit program, though DjVu Solo 3.1 or DjVu Editor might have been used as well. Initially the words were inserted as annotations (highlights with transparent color) for rectangular areas dragged over the handwritten words. The DjVu files were passed through a simple Perl script that extracted the annotation coordinates using djvused, converted them to djvused "text" coordinates, and then re-inserted them as text with djvused.
To create the DjVu version with Line Tooltips, the DjVu files were edited in DjVu Solo 3.1. The word annotations were deleted and replaced by line annotations (while keeping the original text mapping created with the word annotations).
Another method for embedding text behind handwriting
Two years ago, we were the first to present ASCII text behind handrwiting in our News Article: DjVu Text-Behind-Handwriting Introduced!
We used a different method than the one described above.
We first OCRed the handwritten page. While text could not be recognized, the bounding boxes for each word were generated in this way. We then used the
DjVutoXML export utility (now part of both DjVuLibre and Document Express Enterprise (CLE)) to create a structured XML file. We manually edited the XML file, replacing the "garbage text" generated by OCR with the real words. We then used the XMLtoDjVu import utility to import the transcript text into the DjVu file.
How should YOU embed transcripts into handwritten pages?
Follow either of the pioneering methods described above. Voice your interest in the Forum of PlanetDjVu for the development of production tools and editors for transcript-embedding.