recent comments

recent articles

  • The Avengers

    Almer S. Tigelaar 11 / 05 / 2012

    Marvel teased us with the release of this film near the end of various previously released super hero flicks like Captain America and Iron Man 2. This would be the movie that unites all the super heroes from the Marvel universe. Well actually, only those that had not been previously licensed to other studios. Hence, you will not find characters from X-Men, Spiderman, or the Fantastic Four in this movie. Director Joss Whedon brings back fond memories of creative television series like Firefly and Dollhouse, but what does he make of a 220 million blockbuster production?

    read more 0 comments
  • Hugo

    Almer S. Tigelaar 06 / 03 / 2012

    Hugo is based on a relatively recently released (2007) award winning book by Brian Selznick. It is not surprising that the film rights to the books were quickly sold, and certainly not by the least of directors either: Martin Scorsese. He has a career spanning decades and has directed a string of movies in recent years which I liked, among which are Shutter Island, The Departed and Gangs of New York. However, those were admittedly all in different, less family friendly, genres. So, I went to Hugo hoping to be pleasantly surprised.

    read more 0 comments
  • How long would it take to read Wikipedia?

    Almer S. Tigelaar 21 / 02 / 2012

    Wikipedia has become the de facto encyclopedia on the Internet. A traditional encyclopedia spans many textbook volumes which would take any normal person ages to read. Few people would likely engage in such an endeavor. However, since Wikipedia is readily accessible: should you take up the challenge?

    read more 0 comments

Almer S. Tigelaar » Graduation Committees

Sander Bockting: Collection Selection for Distributed Web Search

Almer S. Tigelaar 16 / 02 / 2009, 15:00

Collection Selection for Distributed Web Search using Highly Discriminative Keys, Query-driven Indexing and PageRank.
by Sander Bockting

View in Repository
Graduation Photo’s

Abstract
Current popular web search engines, such as Google, Live Search and Yahoo!, rely on crawling to build an index of the World Wide Web. Crawling is a continuous process to keep the index fresh and generates an enormous amount of data traffic. By far the largest part of the web remains unindexed, because crawlers are unaware of the existence of web pages and they have difficulties crawling dynamically generated content. These problems were the main motivation to research distributed web search.

We assume that web sites, or peers, can index a collection consisting of local content, but possibly also content from other web sites. Peers cooperate with a broker by sending a part of their index. Receiving indices from many peers, the broker gains a global overview of the peers’ content. When a user poses a query to a broker, the broker selects a few peers to which it forwards the query. Selected peers should be promising to create a good result set with many relevant documents. The result sets are merged at the broker and sent to the user. This research focuses on collection selection, which corresponds to the selection of the most promising peers. The use of highly discriminative keys is employed as a strategy to select those peers. A highly discriminative key is a term set which is an index entry at the broker. The key is highly discriminative with respect to the collections because the posting lists pointing to the collections are relatively small. Query-driven indexing is applied to reduce the index size by only storing index entries that are part of popular queries. A PageRank-like algorithm is also tested to assign scores to collections that can be used for ranking.

The Sophos prototype was developed to test these methods. Sophos was evaluated on different aspects, such as collection selection performance and index sizes. The performance of the methods is compared to a baseline that applied language modeling onto merged documents in collections. The results show that Sophos can outperform the baseline with ad-hoc queries on a web based test set. Query-driven indexing is able to substantially reduce index sizes against a small loss in collection selection performance. We also found large differences in the level of difficulty to answer queries on various corpus splits.

More in Graduation Committees: