recent comments

recent articles

  • How long would it take to read Wikipedia?

    Almer S. Tigelaar 21 / 02 / 2012

    Wikipedia has become the de facto encyclopedia on the Internet. A traditional encyclopedia spans many textbook volumes which would take any normal person ages to read. Few people would likely engage in such an endeavor. However, since Wikipedia is readily accessible: should you take up the challenge?

    read more 0 comments
  • Life in a Day

    Almer S. Tigelaar 09 / 02 / 2012

    The premise behind the YouTube documentary “Life in a Day” is interesting: invite everyone around the world to shoot video on one specific day: July 24th 2010. Have people upload their raw footage and edit it so it becomes a short, ninety minute, documentary that chronicles a single day on our planet. Does this extreme form of crowdsourcing actually work?

    read more 0 comments
  • Top 8 Prejudices about Americans

    Almer S. Tigelaar 07 / 02 / 2012

    When travelling abroad it is difficult to go with an open mind. Despite our best efforts we bring with us an excess of prejudice shaped by our own culture and view of the destination country. So to it was for me when I visited the United States. When coming back, people at home are very insistent that you play into their prejudice regarding where you’ve been as well, perhaps as a means of reinforcing their own identity.

    read more 0 comments

Category: Graduation Committees

Bertold van Voorst: Cluster-based Collection Selection in Uncooperative Distributed Information Retrieval

Almer S. Tigelaar 13 / 07 / 2010, 14:00

Cluster-based Collection Selection in Uncooperative Distributed Information Retrieval
by Bertold van Voorst

View in Repository

Abstract
The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to  compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering  algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

read more 0 comments

Koen Lavooij: Near-real Time Statistics Gathered from a Continuous and Voluminous Data Mutation Stream

Almer S. Tigelaar 17 / 02 / 2010, 15:45

Near-real Time Statistics Gathered from a Continuous and Voluminous Data Mutation Stream
by Koen Lavooij

View in Repository

Abstract
The amount of digital data is growing fast. Providing that information as a service is not enough, with the amount of information available. To support the users in finding information, supporting systems have been developed to extract specific information from a large amount of stored data.

Finding or extracting interesting information is as least as important as providing the original data. The “collective intelligence? of a large number of users can be used to order the information. The ordered information is of much greater value when compared to the unordered information, because it provides the user with an overview of interesting and less interesting information.
Current database systems are not able to provide ranked information by analyzing a massive amount of user feedback (e.g. clicks) within a short period of time. Therefore, the systems update the answers periodically.

In this thesis, a Stream Processing Engine (SPE) is being adapted. The modified SPE accepts a stream of mutations to a virtual data storage as opposed a stream of tuples. The newly created system exploits the properties of statistical functions in order to efficiently aggregate live statistics over a large stream of mutations.
The newly created system is able to provide answers to a small set of continuous queries. The answers to the queries will be continuously maintained, instead of recalculated. Therefore, the system is able to provide the answers to the continuous queries instantly and with low latency for a large number of users.

read more 0 comments

Kien Tjin-Kam-Jet: Result Merging for Efficient Distributed Information Retrieval

Almer S. Tigelaar 03 / 04 / 2009, 16:00

Result Merging for Efficient Distributed Information Retrieval
by Kien T.E. Tjin-Kam-Jet

View in Repository

Abstract
Centralized Web search has difficulties with crawling and indexing the Visible Web. The Invisible Web is estimated to contain much more content, and this content is even more difficult to crawl.
Metasearch, a form of distributed search, is a possible solution. However, a major problem is how to merge the results from several search engines into a single result list. We train two types of Support Vector Machines (SVMs): a regression model and preference classification model. Round Robin (RR) is used as our merging baseline. We varied the number of search engines being merged, the selection policy, and the document collection size of the engines. Our findings show that RR is the fastest method and that, in a few cases, it performs as well as regression-SVM. Both SVM methods are much slower and, judging by performance, regression-SVM is the best of all three methods. The choice of which method to use depends strongly on the usage scenario. In most cases, we recommend using regression-SVM.

read more 0 comments

Sander Bockting: Collection Selection for Distributed Web Search

Almer S. Tigelaar 16 / 02 / 2009, 15:00

Collection Selection for Distributed Web Search using Highly Discriminative Keys, Query-driven Indexing and PageRank.
by Sander Bockting

View in Repository
Graduation Photo’s

Abstract
Current popular web search engines, such as Google, Live Search and Yahoo!, rely on crawling to build an index of the World Wide Web. Crawling is a continuous process to keep the index fresh and generates an enormous amount of data traffic. By far the largest part of the web remains unindexed, because crawlers are unaware of the existence of web pages and they have difficulties crawling dynamically generated content. These problems were the main motivation to research distributed web search.

We assume that web sites, or peers, can index a collection consisting of local content, but possibly also content from other web sites. Peers cooperate with a broker by sending a part of their index. Receiving indices from many peers, the broker gains a global overview of the peers’ content. When a user poses a query to a broker, the broker selects a few peers to which it forwards the query. Selected peers should be promising to create a good result set with many relevant documents. The result sets are merged at the broker and sent to the user. This research focuses on collection selection, which corresponds to the selection of the most promising peers. The use of highly discriminative keys is employed as a strategy to select those peers. A highly discriminative key is a term set which is an index entry at the broker. The key is highly discriminative with respect to the collections because the posting lists pointing to the collections are relatively small. Query-driven indexing is applied to reduce the index size by only storing index entries that are part of popular queries. A PageRank-like algorithm is also tested to assign scores to collections that can be used for ranking.

The Sophos prototype was developed to test these methods. Sophos was evaluated on different aspects, such as collection selection performance and index sizes. The performance of the methods is compared to a baseline that applied language modeling onto merged documents in collections. The results show that Sophos can outperform the baseline with ad-hoc queries on a web based test set. Query-driven indexing is able to substantially reduce index sizes against a small loss in collection selection performance. We also found large differences in the level of difficulty to answer queries on various corpus splits.

read more 0 comments