Shard Ranking and Cutoff Estimation for Topically Partitioned Collections


Shard Ranking and Cutoff Estimation for Topically Partitioned Collections
Kulkarni, A. & Tigelaar, A.S. & Hiemstra, D. & Callan, J.
In Proceedings of CIKM 2012, Maui, Hawaii, United States of America.

View in Repository

Large document collections can be partitioned into topical shards to facilitate distributed search. In a low-resource search environment only a few of the shards can be searched in parallel. Such a search environment faces two intertwined challenges. First, determining which shards to consult for a given query: shard ranking. Second, how many shards to consult from the ranking: cutoff estimation. In this paper we present a family of three algorithms that address both of these problems. As a basis we employ a commonly used data structure, the central sample index (CSI), to represent the shard contents. Running a query against the CSI yields a flat document ranking that each of our algorithms transforms into a tree structure. A bottom up traversal of the tree is used to infer a ranking of shards and also to estimate a stopping point in this ranking that yields cost-effective selective distributed search. As compared to a state-of-the-art shard ranking approach the proposed algorithms provide substantially higher search efficiency while providing comparable search effectiveness.

Presented by Anagha Kulkarni at the 2012 Conference on Information and Knowledge Management.

Peer-to-Peer Information Retrieval


The following is a brief summary of my PhD thesis. You may download the thesis by clicking here.

The Internet has become an integral part of our daily lives. However, the essential task of finding information is dominated by a handful of large centralised search engines. In this thesis we study an alternative to this approach. Instead of using large data centres, we propose using the machines that we all use every day: our desktop, laptop and tablet computers, to build a peer-to-peer web search engine. We provide a definition of the associated research field: peer-to-peer information retrieval. We examine what separates it from related fields, give an overview of the work done so far and provide an economic perspective on peer-to-peer search. Furthermore, we introduce our own architecture for peer-to-peer search systems, inspired by BitTorrent.

Distributing the task of providing search results for queries introduces the problem of query routing: a query needs to be sent to a peer that can provide relevant search results. We investigate how the content of peers can be represented so that queries can be directed to the best ones in terms of relevance. While cooperative peers can provide their own representation, the content of uncooperative peers can be accessed only through a search interface and thus they can not actively provide a description of themselves. We look into representing these uncooperative peers by probing their search interface to construct a representation. Finally, the capacity of the machines in peer-to-peer networks differs considerably, making it challenging to provide search results quickly. To address this, we present an approach where copies of search results for previous queries are retained at peers and used to serve future requests and show participation can be incentivised using reputations.

There are still problems to be solved before a real-world peer-to-peer web search engine can be built. This thesis provides a starting point for this ambitious goal and also provides a solid basis for reasoning about peer-to-peer information retrieval systems in general.

Designing a Thesis

A while ago I spent quite some time to research the best options for designing my thesis. I used ideas from various sources, and in this brief article I will explain some of the choices I made, which will hopefully be useful for those that still need to complete their own thesis. Many of these wisdoms are part of the excellent classic thesis style.

1) Tools
Before you start writing, you should pick your tools. I used LyX to typeset my thesis, which is a LaTeX front-end. I have been using it for years, also for my publications, and have grown used to it. It’s stable, and easy to use for beginners. Unlike LaTeX, you don’t have to spend a lot of time memorizing arcane codes, which really is unnecessary anyway in a time where graphical user interfaces dominate. Of course: you still have the power of LaTeX underneath, which is nice, especially for more sophisticated typesetting tweaks.

2) Fonts
With regard to the document content, one of the first choices that you should make is that of the fonts you want to use. Although a particular font is never really right or wrong, LaTeX shields you from making really bad choices here, unlike for example Microsoft Word. There are three categories of fonts you will need to choose: a serif font, a sans-serif font and a mono-space font.


The serif font, sometimes termed roman font, is the most important. You probably remember the lined paper on which you learned to write. Those lines were not only there to force you to write on them, but also to guide your eyes. A serif font has subtle strokes on each character: when you view a page with serif text from a distance, and squint your eyes a bit, you will see that these strokes form ‘virtual’ lines as well. Hence, serifs aid reading by preventing blocks of text from looking ‘wobbly’. This is the reason why most running text is usually set in a serif font. However, times are changing, and more blocks of text are being set in sans-serif these days.

The sans-serif font is, as the name implies, without serifs. Such a font is often used in glossy magazines, and has a cleaner, less cluttered, look. This also makes it more suitable for computer screens, because these typically have a lower display resolution, which can make serifs look ugly. In a thesis, it is used primarily for chapter and section titles. Alternatively, you can choose to use a serif font with ‘small caps’ for titles. This gives a more classical look, whereas sans-serif fonts give your text a more modern feel. Sans-serifs should generally not be used for large blocks of text, unless you really want that and know what you are doing.

The mono-space, or typewriter, font is normally used in places where each character needs to take up the same amount of space, for readability. The best example is a listing of computer code. However, mono-spaced text is not as comfortable to read as text set in a serif or sans-serif font. Since the space each character occupies needs to be equal, visual readability aids, like ligatures and kerning, can not be used. Use mono-spaced text conservatively.

My choices were Palatino as serif, using small caps for titles, and Bera Mono for mono-space. A good overview of fonts for LaTeX can be found here. Make sure that, besides the font sizes, you choose other settings optimally for the fonts you pick, for example: for Palatino a slightly higher line-spread is better for readability, and Bera Mono needs to be scaled down in order to properly complement Palatino. Also, for the PDF output consider using microtype, which allows you to fine-tune settings such as protrusion and expansion.

As a final font tip: try to avoid using bold text where you can, particularly in running text, as this draws unnecessary visual attention in printed matter. Bold text in print is the typographic equivalent of a ‘blinking’ element on a web page: annoying. If you want to emphasize something use italic instead. Bold text is okay for titles, but I’d avoid it even there if possible.

3) Page Lay-out
Consult with your printer to see what type of output they want to have. It is common in the Netherlands to print a thesis in B5. While that is 240mmx180mm by default, some printers use other variants of B5 with slightly different dimensions. Since you are making a book: make sure to select a double-sided lay-out and ask what the binding correction should be: this is an offset that pushes the center of the page content slightly to the left for left pages, and slightly to the right for right pages, which results in optically centered pages after they are bound. Also double-check the page margins. LaTeX chooses very liberal margins by default, which you may want to reduce in order to more effectively use the available space.

Another point of attention with respect to page lay-out is where in the text a figure or table is mentioned, and where it really is in the document. LaTeX has a number of placement rules for this, which can be overridden. A good automatic result is usually obtained by placing the text that refers to a figure or table directly ‘below’ it in in your TeX file, but in some cases you may need to override this placement. Keep in mind that you are working with a double-sided lay-out, which gives a bit more placement freedom, as people continuously see two pages in your document simultaneously.

Try to keep the number of color pages you have as low as you can, as these are expensive when you print your thesis. Restrict it to graphs or pages where a strong visual aesthetic matters.

4) Table of Contents
As a general rule: do not include more than two ‘levels’ in your table of contents: chapter and section. So, no subsections or subsubsections. Besides, if you need numbered subsubsections, you may want to consider restructuring your text entirely: perhaps the parent section should be a separate chapter instead.

There is a fair number of people that align all the titles to the left and the page numbers to the right. It doesn’t really make sense to do this (what are you going to do: add up the page numbers? really?). The page numbers are there to help the reader, and hence should be placed directly behind the titles. This also removes the need for the visual horror of thick or dotted horizontal lines in the table of contents. Finally, consider adding the bibliography as an item in the table of contents, since this is quite an important part of a scientific work.


5) Chapter Openings
The convention is to put chapter opening pages on the right side of your book. Take special care of the opening page of each chapter. For LaTeX, there are many packages that can help you make these look more visually appealing, such as fncychap. The main rule is: keep it simple, less is more. Some people use a separate page for the start of new chapters, with only the heading, which can also be quite visually pleasing.

6) Tables
This is probably the one most often abused visual element in any document. There are two important things to keep in mind. Firstly, use tables only for listing things structurally, that’s what they are for. In all other cases: use figures. Secondly, please do not use vertical lines in your table, if you need vertical lines: it’s not a table, it’s a figure. As a small visual test: create a table in your favorite word processor or spreadsheet and experiment with how it looks with only horizontal lines and horizontal and vertical lines. You will find that using only horizontal lines makes the table easier to read. Even when using only horizontal lines: use as few as possible, focus instead on properly aligning the data that you are presenting, which alleviates the need for lines in many cases. As a general rule you should use a line above and below the first row, and below the last row. The outside lines may be slightly bolder with respect to the other lines. Tables may look better when they span the entire page width, but this depends on the content.


7) Figures
When you include any figure in your document, really any figure, use a scalable (vector) format where possible. In LaTeX the most obvious choice for this is Encapsulated Postscript (EPS) files. If you must include a non-scalable (raster) image, avoid lossy formats, like JPEG (use PNG instead). Also: include high resolution images. The reason for all this? Many theses include low resolution non-scalable graphics. Unfortunately, this looks horrible when printed: blocky and pixelated. Either your graphics need to have a higher resolution then the resolution used for the print (typically 300ppi), or you need to use scalable (vector) graphics (which looks optimal regardless of the printer’s ppi). A nice, but costly, way to convert raster to vector is by taking the raster image and drawing over it with a vector graphics tool.

8) Captions
Whether you use hanging or non-hanging captions for your tables and figures is a personal choice, same goes for bold caption text or not (I’d personally try to avoid that). By convention, captions for tables are always placed above the table, and captions for figures are always placed below the figure. Try to keep your table captions as short as possible (avoid multi-line captions if you can). It’s visually nicer to have more elaborate text below the element you are presenting. Hence, for figures this type of text can go directly in the caption.

I hope these tips will help you design a better thesis in the short-run and help you produce more visually appealing texts in the future.

Peer-to-Peer Information Retrieval: An Overview

Peer-to-Peer Information Retrieval: An Overview
Tigelaar, A.S. & Hiemstra, D. & Trieschnigg, D.
ACM Transactions on Information Systems, Volume 30, Issue 2, 2012, ISSN 1046-8188, (pp. 9:1-9:34).

View at ACM Digital Library
View in Repository (Author’s Version)

Peer-to-peer technology is widely used for file sharing. In the past decade a number of prototype peer-to-peer information retrieval systems have been developed. Unfortunately, none of these has seen widespread real-world adoption and thus, in contrast with file sharing, information retrieval is still dominated by centralized solutions. In this article we provide an overview of the key challenges for peer-to-peer information retrieval and the work done so far. We want to stimulate and inspire further research to overcome these challenges. This will open the door to the development and large-scale deployment of real-world peer-to-peer information retrieval systems that rival existing centralized client-server solutions in terms of scalability, performance, user satisfaction, and freedom.

How Big is the Web?

The World Wide Web is the most visible part of the Internet consisting of a very large collection of pages. We all know that the web is big, but: how big is it really, and how big will it become?

We can express the size of the web in either a number of pages or a number of sites. The number of sites has been tracked by Netcraft for many years. Initially their estimate was based solely on the number of registered sites. However, at some point it became popular to register domain names without actually using them (for resale purposes). So, they adjusted their methodology to track the number of `active’ sites. Interestingly, the total number of sites seems to rise exponentially, while development of the number of active sites increases more slowly and is much closer to linearity. This can be seen in the image below:

I conducted a simple linear regression in July 2009 on the Netcraft data available back then: 80 million active sites. This yielded an estimate of about 100 million active sites for January 2013. However, as we can see in this graph: that point has already been surpassed, since there are nearly 110 million active sites as of July 2011. Hence, a more accurate estimate for January 2013 is around 125 million active sites: more than 1.5 times as many as in July 2009!

The ISC Domain Survey attempts to count the same as the blue `host names’ line in the Netcraft Survey. The gives an estimate of about 800 million for July 2010 which is significantly higher than that of Netcraft for the same period: 200 million. Some attribute this to methodological differences. Nevertheless, the exponential shape of the line is similar to that of Netcraft, except for the weird drop and stagnation around October 2009.


More than a quarter of the world population has access to Internet: as of March 2011 there are about 2 billion Internet users and 7 billion people on the planet. This number is expected to increase, albeit I suspect more slowly than before, since deployment in developing countries will be somewhat slower. There is of course a connection between the number of Internet users and the number of web pages. The numbers suggest that in the present situation there is about 1 active website for every 16 Internet users.

All of this gives us no information yet about the actual number of web pages. Based on data from 2005 Boutell estimates this to be 273 pages per website. However, their methodology is questionable. Google claimed to have passed the one trillion unique URL’s in mid-2008. Applying the same methodology this would mean that each active host would have about \frac{1\cdot10^{12}}{175\cdot10^{6}}\approx5714 unique URL’s on average which seems very high to me. However, assuming this number includes deep links, those with HTTP GET parameters in the URL, already makes it somewhat more realistic.

One site based on a master’s thesis gives a daily estimate of the size of the indexed world wide web: the part that can be found by search engines. It claims that there are about 17.83 billion indexed web pages. Given this number there are about \frac{18\cdot10^{9}}{175\cdot10^{6}}\approx103 web pages per active host on average. Keep in mind that these statistics follow a long tail distribution: a few big hosts contain a large number of pages and many small hosts have only very few pages.

If the size of the indexed web grows proportionally with the number of active sites, then there will be nearly 20 billion1 web pages in January 2013. Remember this is only the visible web: the part that search engines can see. The deep web, the part not indexed by search engines, may well be over 55 times2 bigger than this in 2013: about 1.2 trillion pages.

This entire methodology still relies on static web pages and dynamic web pages that work via HTTP GET. The AJAX methodology: a web page that retrieves content and updates itself dynamically without reloading, makes the situation even more complex. What counts as a page in such a set-up? Any URL can have many instantiations that depend not solely on time, but on the user’s characteristics and preferences.

So, this question is going to become increasingly hard to answer because new technologies make the meaning of `page’ and `site’ much less well defined.

1) 18*\frac{125}{110}\approx20
2) 5714/103\approx55