Efficient Approximate String Matching with Synonyms and Taxonomies
Strings are ubiquitous. When being collected from various sources, strings are often inconsistent, which means that they can have the same or similar meaning expressed in different forms, such as with typographical mistakes. Finding similar strings given such inconsistent datasets has been researched extensively during past years under an umbrella problem called approximate string matching.
This thesis aims to enhance the quality of the approximate string matching by detecting similar strings using their meanings besides typographical errors. Specifically, this thesis focuses on utilising synonyms and taxonomies, since both are commonly available knowledge sources. This research is to use each type of knowledge to address either a selection or join tasks, where the first task aims to find strings similar to a given string, and the second task is to find pairs of strings that are similar. The desired output is either all strings similar to a given extent (i.e., all-match) or the top-k most similar strings.
The first contribution of this thesis is to address the top-k selection problem considering synonyms. Here, we propose algorithms with different optimisation goals: to minimise the space cost, to maximise the selection speed, or to maximise the selection speed under a space constraint. We model the last goal as a variant of an 0/1 knapsack problem and propose an efficient solution based on the branch and bound paradigm.
Next, this thesis solves the top-k join problem considering taxonomy relations. Three algorithms, two based on sorted lists and one based on tries, are proposed, in which we use pre-computations to accelerate list scan or use predictions to eliminate unnecessary trie accesses. Experiments show that the trie-based algorithm has a very fast response time on a vast dataset.
The third contribution of this thesis is to deal with the all-match join problem considering taxonomy relations. To this end, we identify the shortcoming of a standard prefix filtering principle and propose an adaptive filtering algorithm that is tuneable towards the minimised join time. We also design a sampling-based estimation procedure to suggest the best parameter in a short time with high accuracy.
Lastly, this thesis researches the all-match join task by integrating typographical errors, synonyms, and taxonomies simultaneously. Key contributions here include a new unified similarity measure that employs multiple measures, as well as a non-trivial approximation algorithm with a tight theoretical guarantee. We furthermore propose two prefix filtering principles: a fast heuristic and accurate dynamic programming, to strive for the minimised join time.
Availability of the dissertation
An electronic version of the doctoral dissertation is available on the e-thesis site of the University of Helsinki at http://urn.fi/URN:ISBN:978-951-51-6988-4. The slide deck of Pengfei's lectio praecursoria is available here.