Automatic document classification and extraction system (ADoCES)
Department of Computer and Information Science
Doctor of Philosophy
Computer and Information Science
Ng, Peter A.
Thomas, Gary L.
Kurfess, Franz J.
Curtis, Ronald S.
Data structures (Computer science).
Information storage and retrieval systems.
Document processing is a critical element of office automation. Document image processing begins from the Optical Character Recognition (OCR) phase with complex processing for document classification and extraction. Document classification is a process that classifies an incoming document into a particular predefined document type. Document extraction is a process that extracts information pertinent to the users from the content of a document and assigns the information as the values of the “logical structure” of the document type. Therefore, after document classification and extraction, a paper document will be represented in its digital form instead of its original image file format, which is called a frame instance. A frame instance is an operable and efficient form that can be processed and manipulated during document filing and retrieval. This dissertation describes a system to support a complete procedure, which begins with the scanning of the paper document into the system and ends with the output of an effective digital form of the original document. This is a general-purpose system with “learning” ability and, therefore, it can be adapted easily to many application domains.
In this dissertation, the “logical closeness” segmentation method is proposed. A novel representation of document layout structure - Labeled Directed Weighted Graph (LDWG) and a methodology of transforming document segmentation into LDWG representation are described. To find a match between two LDWGs, string representation matching is applied first instead of doing graph comparison directly, which reduces the time necessary to make the comparison. Applying artificial intelligence, the system is able to learn from experiences and build samples of LDWGs to represent each document type. In addition, the concept of frame templates is used for the document logical structure representation. The concept of Document Type Hierarchy (DTH) is also enhanced to express the hierarchical relation over the logical structures existing among the documents.
njit-etd1999-074 (165 pages ~ 6,600 KB pdf)
Please complete this Feedback Form to inform us about your experience using this website. It will assist us in better serving your information needs in the future. Thank You!
Created August 5, 2008