Information retrieval and mining in high dimensional databases
Department of Computer and Information Science
Doctor of Philosophy
Wang, Jason T. L.
McHugh, James A.
Shih, Frank Y.
Image processing --Digital techniques.
This dissertation is composed of two parts. In the first part, we present a framework for finding information (more precisely, active patterns) in three dimensional (3D) graphs. Each node in a graph is an undecoraposable or atomic unit and has a label. Edges are links between the atomic units. Patterns are rigid substructures that may occur in a graph after allowing for an arbitrary number of whole-structure rotations and translations as well as a small number (specified by the user) of edit operations in the patterns or in the graph. (When a pattern appears in a graph only after the graph has been modified, we call that appearance "approximate occurrence.") The edit operations include relabeling a node, deleting a node and inserting a node. The proposed method is based on the geometric hashing technique, which hashes node-triplets of the graphs into a 3D table and compresses the label-triplets in the table. To demonstrate the utility of our algorithms, we discuss two applications of them in scientific data mining. First, we apply the method to locating frequently occurring motifs in two families of proteins pertaining to RNA-directed DNA Polymerase and Thymidylate Synthase, and use the motifs to classify the proteins. Then we apply the method to clustering chemical compounds pertaining to aromatic, bicyclicalkanes and photosynthesis. Experimental results indicate the good performance of our algorithms and high recall and precision rates for both classification and clustering. We also extend our algorithms for processing a class of similarity queries in databases of 3D graphs.
In the second part of the dissertation, we present an index structure, called MetricMap, that takes a set of objects and a distance metric and then maps those objects to a k-dimensional pseudo-Euclidean space in such a way that the distances among objects are approximately preserved. Our approach employs sampling and the calculation of eigenvalues and eigenvectors. The index structure is a useful tool for clustering and visualization in data intensive applications, because it replaces expensive distance calculations by sum-of-square calculations. This can make clustering in large databases with expensive distance metrics practical.
We compare the index structure with another data mining index structure, FastMap, proposed by Faloutsos and Lin, according to two criteria: relative error and clustering accuracy. For relative error, we show that (i) FastMap gives a lower relative error than MetrieMap for Euclidean distances, (ii) MetricMap gives a lower relative error than Fast Map for non-Euclidean distances (i.e., general distance metrics), and (iii) combining the two reduces the error yet further. A similar result is obtained when comparing the accuracy of clustering. These results hold for different data sizes. The main qualitative conclusion is that these two index structures capture complenleiltary information about distance metrics and therefore can be used together to great benefit. The net effect is that multi-day computations can be done in minutes.
We have implemented the proposed algorithms and the MetricMap index structure into a toolkit. This toolkit will be useful for data mining, visualization, and approximate retrieval in scientific, multimedia and high dimensional databases.
njit-etd2000-092 (179 pages ~ 7,980 KB pdf)
Please complete this Feedback Form to inform us about your experience using this website. It will assist us in better serving your information needs in the future. Thank You!
Created June 19, 2007