Shard Ranking and Cutoff Estimation for Topically Partitioned Collections


Shard Ranking and Cutoff Estimation for Topically Partitioned Collections
Kulkarni, A. & Tigelaar, A.S. & Hiemstra, D. & Callan, J.
In Proceedings of CIKM 2012, Maui, Hawaii, United States of America.

View in Repository

Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

Presented by Anagha Kulkarni at the 2012 Conference on Information and Knowledge Management.

Peer-to-Peer Information Retrieval


The following is a brief summary of my PhD thesis. You may download the thesis by clicking here.

The Internet has become an integral part of our daily lives. However, the essential task of finding information is dominated by a handful of large centralised search engines. In this thesis we study an alternative to this approach. Instead of using large data centres, we propose using the machines that we all use every day: our desktop, laptop and tablet computers, to build a peer-to-peer web search engine. We provide a definition of the associated research field: peer-to-peer information retrieval. We examine what separates it from related fields, give an overview of the work done so far and provide an economic perspective on peer-to-peer search. Furthermore, we introduce our own architecture for peer-to-peer search systems, inspired by BitTorrent.

Distributing the task of providing search results for queries introduces the problem of query routing: a query needs to be sent to a peer that can provide relevant search results. We investigate how the content of peers can be represented so that queries can be directed to the best ones in terms of relevance. While cooperative peers can provide their own representation, the content of uncooperative peers can be accessed only through a search interface and thus they can not actively provide a description of themselves. We look into representing these uncooperative peers by probing their search interface to construct a representation. Finally, the capacity of the machines in peer-to-peer networks differs considerably, making it challenging to provide search results quickly. To address this, we present an approach where copies of search results for previous queries are retained at peers and used to serve future requests and show participation can be incentivised using reputations.

There are still problems to be solved before a real-world peer-to-peer web search engine can be built. This thesis provides a starting point for this ambitious goal and also provides a solid basis for reasoning about peer-to-peer information retrieval systems in general.

Designing a Thesis

A while ago I spent quite some time to research the best options for designing my thesis. I used ideas from various sources, and in this brief article I will explain some of the choices I made, which will hopefully be useful for those that still need to complete their own thesis. Many of these wisdoms are part of the excellent classic thesis style.

1) Tools
Before you start writing, you should pick your tools. I used LyX to typeset my thesis, which is a LaTeX front-end. I have been using it for years, also for my publications, and have grown used to it. It’s stable, and easy to use for beginners. Unlike LaTeX, you don’t have to spend a lot of time memorizing arcane codes, which really is unnecessary anyway in a time where graphical user interfaces dominate. Of course: you still have the power of LaTeX underneath, which is nice, especially for more sophisticated typesetting tweaks.

2) Fonts
With regard to the document content, one of the first choices that you should make is that of the fonts you want to use. Although a particular font is never really right or wrong, LaTeX shields you from making really bad choices here, unlike for example Microsoft Word. There are three categories of fonts you will need to choose: a serif font, a sans-serif font and a mono-space font.


The serif font, sometimes termed roman font, is the most important. You probably remember the lined paper on which you learned to write. Those lines were not only there to force you to write on them, but also to guide your eyes. A serif font has subtle strokes on each character: when you view a page with serif text from a distance, and squint your eyes a bit, you will see that these strokes form ‘virtual’ lines as well. Hence, serifs aid reading by preventing blocks of text from looking ‘wobbly’. This is the reason why most running text is usually set in a serif font. However, times are changing, and more blocks of text are being set in sans-serif these days.

The sans-serif font is, as the name implies, without serifs. Such a font is often used in glossy magazines, and has a cleaner, less cluttered, look. This also makes it more suitable for computer screens, because these typically have a lower display resolution, which can make serifs look ugly. In a thesis, it is used primarily for chapter and section titles. Alternatively, you can choose to use a serif font with ‘small caps’ for titles. This gives a more classical look, whereas sans-serif fonts give your text a more modern feel. Sans-serifs should generally not be used for large blocks of text, unless you really want that and know what you are doing.

The mono-space, or typewriter, font is normally used in places where each character needs to take up the same amount of space, for readability. The best example is a listing of computer code. However, mono-spaced text is not as comfortable to read as text set in a serif or sans-serif font. Since the space each character occupies needs to be equal, visual readability aids, like ligatures and kerning, can not be used. Use mono-spaced text conservatively.

My choices were Palatino as serif, using small caps for titles, and Bera Mono for mono-space. A good overview of fonts for LaTeX can be found here. Make sure that, besides the font sizes, you choose other settings optimally for the fonts you pick, for example: for Palatino a slightly higher line-spread is better for readability, and Bera Mono needs to be scaled down in order to properly complement Palatino. Also, for the PDF output consider using microtype, which allows you to fine-tune settings such as protrusion and expansion.

As a final font tip: try to avoid using bold text where you can, particularly in running text, as this draws unnecessary visual attention in printed matter. Bold text in print is the typographic equivalent of a ‘blinking’ element on a web page: annoying. If you want to emphasize something use italic instead. Bold text is okay for titles, but I’d avoid it even there if possible.

3) Page Lay-out
Consult with your printer to see what type of output they want to have. It is common in the Netherlands to print a thesis in B5. While that is 240mmx180mm by default, some printers use other variants of B5 with slightly different dimensions. Since you are making a book: make sure to select a double-sided lay-out and ask what the binding correction should be: this is an offset that pushes the center of the page content slightly to the left for left pages, and slightly to the right for right pages, which results in optically centered pages after they are bound. Also double-check the page margins. LaTeX chooses very liberal margins by default, which you may want to reduce in order to more effectively use the available space.

Another point of attention with respect to page lay-out is where in the text a figure or table is mentioned, and where it really is in the document. LaTeX has a number of placement rules for this, which can be overridden. A good automatic result is usually obtained by placing the text that refers to a figure or table directly ‘below’ it in in your TeX file, but in some cases you may need to override this placement. Keep in mind that you are working with a double-sided lay-out, which gives a bit more placement freedom, as people continuously see two pages in your document simultaneously.

Try to keep the number of color pages you have as low as you can, as these are expensive when you print your thesis. Restrict it to graphs or pages where a strong visual aesthetic matters.

4) Table of Contents
As a general rule: do not include more than two ‘levels’ in your table of contents: chapter and section. So, no subsections or subsubsections. Besides, if you need numbered subsubsections, you may want to consider restructuring your text entirely: perhaps the parent section should be a separate chapter instead.

There is a fair number of people that align all the titles to the left and the page numbers to the right. It doesn’t really make sense to do this (what are you going to do: add up the page numbers? really?). The page numbers are there to help the reader, and hence should be placed directly behind the titles. This also removes the need for the visual horror of thick or dotted horizontal lines in the table of contents. Finally, consider adding the bibliography as an item in the table of contents, since this is quite an important part of a scientific work.


5) Chapter Openings
The convention is to put chapter opening pages on the right side of your book. Take special care of the opening page of each chapter. For LaTeX, there are many packages that can help you make these look more visually appealing, such as fncychap. The main rule is: keep it simple, less is more. Some people use a separate page for the start of new chapters, with only the heading, which can also be quite visually pleasing.

6) Tables
This is probably the one most often abused visual element in any document. There are two important things to keep in mind. Firstly, use tables only for listing things structurally, that’s what they are for. In all other cases: use figures. Secondly, please do not use vertical lines in your table, if you need vertical lines: it’s not a table, it’s a figure. As a small visual test: create a table in your favorite word processor or spreadsheet and experiment with how it looks with only horizontal lines and horizontal and vertical lines. You will find that using only horizontal lines makes the table easier to read. Even when using only horizontal lines: use as few as possible, focus instead on properly aligning the data that you are presenting, which alleviates the need for lines in many cases. As a general rule you should use a line above and below the first row, and below the last row. The outside lines may be slightly bolder with respect to the other lines. Tables may look better when they span the entire page width, but this depends on the content.


7) Figures
When you include any figure in your document, really any figure, use a scalable (vector) format where possible. In LaTeX the most obvious choice for this is Encapsulated Postscript (EPS) files. If you must include a non-scalable (raster) image, avoid lossy formats, like JPEG (use PNG instead). Also: include high resolution images. The reason for all this? Many theses include low resolution non-scalable graphics. Unfortunately, this looks horrible when printed: blocky and pixelated. Either your graphics need to have a higher resolution then the resolution used for the print (typically 300ppi), or you need to use scalable (vector) graphics (which looks optimal regardless of the printer’s ppi). A nice, but costly, way to convert raster to vector is by taking the raster image and drawing over it with a vector graphics tool.

8) Captions
Whether you use hanging or non-hanging captions for your tables and figures is a personal choice, same goes for bold caption text or not (I’d personally try to avoid that). By convention, captions for tables are always placed above the table, and captions for figures are always placed below the figure. Try to keep your table captions as short as possible (avoid multi-line captions if you can). It’s visually nicer to have more elaborate text below the element you are presenting. Hence, for figures this type of text can go directly in the caption.

I hope these tips will help you design a better thesis in the short-run and help you produce more visually appealing texts in the future.

Peer-to-Peer Information Retrieval: An Overview

Peer-to-Peer Information Retrieval: An Overview
Tigelaar, A.S. & Hiemstra, D. & Trieschnigg, D.
ACM Transactions on Information Systems, Volume 30, Issue 2, 2012, ISSN 1046-8188, (pp. 9:1-9:34).

View at ACM Digital Library
View in Repository (Author’s Version)

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these has seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralized solutions. In this article we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralized client-server solutions in terms of scalability, performance, user satisfaction, and freedom.

Search Result Caching in Peer-to-Peer Information Retrieval Networks

Search Result Caching in Peer-to-Peer Information Retrieval Networks
Tigelaar, A. S. & Hiemstra, D. & Trieschnigg, D.
In Proceedings of IRFC 2011, Vienna, Austria (pp. 134-148).

View in Repository
View at SpringerLink

For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

Presented at the Information Retrieval Facility Conference 2011 on June 6th 2011 in Vienna, Austria.