Marriott
Library Digitized Collections
 

Intermountain Ski Instructors Association Records

About this Digital Project


Scanning and OCR
Documents were scanned at 300dpi using an Epson Expression 836XL flatbed scanner, and saved as PDF. These image files were then run through Adobe Capture 2.01 (an optical character recognition software) and output as plain text files. It is possible to output HTML, but the resulting code contains a lot of junk and invalid tags.

OCR on older, typewritten documents produces poor results. Colored paper, light ink, and stray marks all contribute to illegible or undecipherable results, and handwritten documents usually produce no results at all. There is a wide range of quality in the original documents of this collection.

We created HTML templates in DreamWeaver 3.0 and pasted the OCR text into the body. There is one template for each box of manuscripts (11 total). The HTML templates contain a number of identical meta tags, but minimal unique indexing was done on each document by keying in a title that contains a brief subject line, date, and the place of the event covered in the document. Templates were used because of DreamWeaver's ability to instantly update the thousands of HTML files that were based on the templates. Each HTML file also contains a link to the PDF document - these were manually inserted into the files generated from the templates.

Indexing and Searching
The HTML documents were indexed by a free indexing software called SWISH-Enhanced. A perl script is activated each time by the user to search the index created by SWISH-E, and produce an HTML page of results. These search results link only to the appropriate PDF documents.

Browsing
Another perl script is used to browse the contents of each box and display an HTML page
of results. The script reads the contents of the specific directory passed by the information in the link, and displays the HTML filename and the information within the files' <title></title> tags. These results do link to the HTML documents, but a link on those pages can carry the user through to the PDF if he/she wishes, or if the OCR'd text is illegible.

Questions or Comments?
Please email Kenning Arlitsch, Head of the Marriott Library Digitization Center, or call at (801) 585-3721.

 

Digitization Center Marriott Library