Creation of a customised character recognition application

This master’s thesis describes the work in creating a customised optical character recognition (OCR) application; intended for use in digitisation of theses submitted to the Uppsala University in the 18th and 19th centuries. For this purpose, an open source software called Gamera has been used for recognition and classification of the characters in the documents. The software provides specific algorithms for analysis of heritage documents and is designed to be used as a tool for creating domain-specific (i.e. customised) recognition applications.By using the Gamera classifier training interface, classifier data was created which reflects the characters in the particular theses. The data can then be used in automatic recognition of ‘new’ characters, by loading it into one of Gamera’s classifiers. The output of Gamera are sets of classified glyphs (i.e. small images of characters), stored in an XML-based format.However, as OCR typically involves translation of images of text into a machine-readable format, a complementary…


1. Introduction
1.2 Background
1.2.1 Project background
1.2.2 The heritage theses
1.2.3 Gamera
1.2.4 The process of digitisation and optical character recognition Scanning and pre-processing Segmentation and classification Page segmentation, translation and output
1.3 Purpose and outline
2. Creation of classifier training data
2.1 Using classifier data
2.2 Broken and touching characters
2.3 Optimisation of training data
3. Page segmentation
3.1 Identification of text in images
3.2 A script for recognition of words
3.3 Modifications on the page segmentation module
3.2.1 Translation into ASCII and Unicode
4. Creation of a script for the OCR –process and for a user interface
5. Recognition accuracy
5.1 Results
5.2 Other OCR-software
6. Discussion
6.1 Creation of representative training sets
6.3 Character sizes & isomorphic glyphs
6.4 Prerequisites
7. Future work
7.1 Pre-processing
7.2 Noise
8. Concluding remarks

Author: Sandgren, Frida

Source: Uppsala University Library

Download URL 2: Visit Now

Leave a Comment