Computerising a library catalogue using optical character recognition
Citation:Glynn Anderson, 'Computerising a library catalogue using optical character recognition', [thesis], Trinity College (Dublin, Ireland). School of Computer Science & Statistics, 1993, pp 164
Anderson TCD THESIS 2842 Computerising a.pdf (PDF) 67.76Mb
Trinity College Library contains several million books. Catalogues for the more modern books have been computerised to allow readers a fast and efficient means of locating a book. The 1872 Printed Catalogue which lists books owned by the library before 1872 has not yet been computerised. The catalogue lists 165,000 books, some of which are the most valuable in the library. The purpose of this project is to write a computer program that will automatically computerise the catalogue using optical character recognition (OCR). OCR is the process by which a digital picture of a portion of text is converted into computer readable text. Each character on the page is represented by a group or ’blob’ of dots or pixels. The role of the computer is twofold; first to decide which pixels should be grouped together (ie which belong to the same character) and second to decide what character each of the blobs of pixels represents. The output of the OCR program is sent to a database and will eventually be incorporated into the existing DYNIX© database, currently in use in the library. The thesis contains a review of several different approaches to OCR, including feature vector analysis, discrimination trees, stroke analysis and neural networks. The implementation and results of a selection of these methods are described. The recognition or classification method used in this project, template matching, has not been implemented before as a primary classification method. The results of this thesis show that template matching compares very favourably with other classification methods. The thesis describes the considerable work undertaken in deriving a good matching algorithm which is the key to success of template matching. The segmentation of lines and characters is described in full including the development of a very efficient perimeter tracing algorithm. Before the final chapters on results, conclusion and future work, there is a chapter explaining how a state machine is used, while classifying, to delimit the fields within each entry on a catalogue page.
Author: Anderson, Glynn
Advisor:Byrne, John G.
Qualification name:Master in Science (M.Sc.)
Publisher:Trinity College (Dublin, Ireland). School of Computer Science & Statistics
Note:TARA (Trinity's Access to Research Archive) has a robust takedown policy. Please contact us if you have any concerns: email@example.com
Type of material:thesis
Availability:Full text available