Shard Ranking and Cutoff Estimation for Topically Partitioned Collections
Kulkarni, A. & Tigelaar, A.S. & Hiemstra, D. & Callan, J.
In Proceedings of CIKM 2012, Maui, Hawaii, United States of America.
Abstract
Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.
Presented by Anagha Kulkarni at the 2012 Conference on Information and Knowledge Management.