Top-k string auto-completion with synonyms

Keyword searching is a ubiquitous activity performed by millions of users daily. However, cognitively formulating and physically typing search queries is a time-consuming and error-prone process.

In response, keyword search engines have widely adopted auto-completion as a means of reducing the efforts required to submit a query. As users enter their query into the search box, auto-completion suggests possible queries the user may have in mind.

Example of usage scenario

Challenges

The existing solutions of auto-completion provide the suggestions based on the beginning of the current input character sequence (i.e. prefix). Although this approach provides satisfactory auto-completion in many cases, it is far from optimal since it fails to take into account the semantic of users' input characters. There are many practical applications where syntactically different strings can represent the same real-world object. For example, Bill is a short form of William and Database Management Systems can be abbreviated as DBMS. However, these equivalence information suggests semantically similar strings that may have been missed by simple prefix based approaches. In this project, we expose these relations between different strings, to support efficient top-k completion queries with synonyms for different space and time complexity trade-offs.

Further reading

You may find our research paper at https://arxiv.org/abs/1611.03751.

Contributions

TWIN TRIES (TT): Two tries are constructed to present strings and synonym rules respectively in order to minimize the space occupancy. Each trie is a compact data structure, where the children of each node are ordered by the highest score among their respective descendants. Applicable synonym rules are indicated by pointers between two tries. An efficient top-k algorithm is developed to search both tries to find the synonym rules.

EXPANSION TRIE (ET): A fast lookup-optimized solution by integrating synonym rules with the corresponding strings. Unlike TT, ET uses a single expended trie to represent both synonym and string rules. Therefore, by efficiently traversing this trie, ET is faster than TT to provide top-k completions. Meanwhile, ET often takes larger space overhead than TT, because ET needs to expand the strings with their applicable rules.

HYBRID TRIES (HT): An optimized structure to strike a good balance between space and time cost for TT and ET. We try to find a balance between lookup speed and space cost by judiciously select part of synonym rules to expand the strings. We show that given a predefined space constraint, the optimal selection of synonym rules is NP-hard, which can be reduced to a 0/1 knapsack problem with item interactions. We provide an empirically efficient heuristic algorithm by extending the branch and bound algorithm.

Download

The top-k auto-completion tool is released as Java source code with corresponding binary executable files for convenience. Note that the binary file requires at least JRE 8 to run.

Version Date Download link
0.1.1 20161129 Source Code Binary Releases and Sample Dataset

Usage:
topk SEARCH_STRING [K] [OPTION]

Perform top-K auto-completions for SEARCH_STRING using TRIE structure, the results are from DICT_FILE in respect of synonyms in SYN_FILE.

All parameters:

Parameter Default Value Comments
SEARCH_STRING The search string.
The maximum number of results returned.
-d /path/to/dict/file "dict.txt" List of dictionary strings with scores.
s /path/to/synonym/file "rule.txt" List of synonym rules.
-t TRIE "ET" The trie structure used in this search. Can be one of TT, ET or HT. If you choose HT, you may want to specify BUDGET.
-b BUDGET 5000 A number indicates the additional space (in bytes) budget can be occupied by HT.

Examples:

// Top-10 lookup for "Intl. Conf"
topk "Intl. Conf"

// Top-5 lookup for "Intl. Conf"
topk "Intl. Conf" 5

// Top-5 lookup for "Intl. Conf" using TT with "dict.txt" and "rule.txt"
topk "Intl. Conf" 5 -d dict.txt -s rule.txt -t TT

// Top-5 lookup for "Intl. Conf" using HT with budget equals 5000
topk "Intl. Conf" 5 -t TT -b 5000