recent comments

recent articles

  • How long would it take to read Wikipedia?

    Almer S. Tigelaar 21 / 02 / 2012

    Wikipedia has become the de facto encyclopedia on the Internet. A traditional encyclopedia spans many textbook volumes which would take any normal person ages to read. Few people would likely engage in such an endeavor. However, since Wikipedia is readily accessible: should you take up the challenge?

    read more 0 comments
  • Life in a Day

    Almer S. Tigelaar 09 / 02 / 2012

    The premise behind the YouTube documentary “Life in a Day” is interesting: invite everyone around the world to shoot video on one specific day: July 24th 2010. Have people upload their raw footage and edit it so it becomes a short, ninety minute, documentary that chronicles a single day on our planet. Does this extreme form of crowdsourcing actually work?

    read more 0 comments
  • Top 8 Prejudices about Americans

    Almer S. Tigelaar 07 / 02 / 2012

    When travelling abroad it is difficult to go with an open mind. Despite our best efforts we bring with us an excess of prejudice shaped by our own culture and view of the destination country. So to it was for me when I visited the United States. When coming back, people at home are very insistent that you play into their prejudice regarding where you’ve been as well, perhaps as a means of reinforcing their own identity.

    read more 0 comments

Category: Research

How Big is the Web?

Almer S. Tigelaar 28 / 07 / 2011, 09:00

The World Wide Web is the most visible part of the Internet consisting of a very large collection of pages. We all know that the web is big, but: how big is it really, and how big will it become?

We can express the size of the web in either a number of pages or a number of sites. The number of sites has been tracked by Netcraft for many years. Initially their estimate was based solely on the number of registered sites. However, at some point it became popular to register domain names without actually using them (for resale purposes). So, they adjusted their methodology to track the number of `active’ sites. Interestingly, the total number of sites seems to rise exponentially, while development of the number of active sites increases more slowly and is much closer to linearity. This can be seen in the image below:

I conducted a simple linear regression in July 2009 on the Netcraft data available back then: 80 million active sites. This yielded an estimate of about 100 million active sites for January 2013. However, as we can see in this graph: that point has already been surpassed, since there are nearly 110 million active sites as of July 2011. Hence, a more accurate estimate for January 2013 is around 125 million active sites: more than 1.5 times as many as in July 2009!

The ISC Domain Survey attempts to count the same as the blue `host names’ line in the Netcraft Survey. The gives an estimate of about 800 million for July 2010 which is significantly higher than that of Netcraft for the same period: 200 million. Some attribute this to methodological differences. Nevertheless, the exponential shape of the line is similar to that of Netcraft, except for the weird drop and stagnation around October 2009.

More than a quarter of the world population has access to Internet: as of March 2011 there are about 2 billion Internet users and 7 billion people on the planet. This number is expected to increase, albeit I suspect more slowly than before, since deployment in developing countries will be somewhat slower. There is of course a connection between the number of Internet users and the number of web pages. The numbers suggest that in the present situation there is about 1 active website for every 16 Internet users.

All of this gives us no information yet about the actual number of web pages. Based on data from 2005 Boutell estimates this to be 273 pages per website. However, their methodology is questionable. Google claimed to have passed the one trillion unique URL’s in mid-2008. Applying the same methodology this would mean that each active host would have about \frac{1\cdot10^{12}}{175\cdot10^{6}}\approx5714 unique URL’s on average which seems very high to me. However, assuming this number includes deep links, those with HTTP GET parameters in the URL, already makes it somewhat more realistic.

One site based on a master’s thesis gives a daily estimate of the size of the indexed world wide web: the part that can be found by search engines. It claims that there are about 17.83 billion indexed web pages. Given this number there are about \frac{18\cdot10^{9}}{175\cdot10^{6}}\approx103 web pages per active host on average. Keep in mind that these statistics follow a long tail distribution: a few big hosts contain a large number of pages and many small hosts have only very few pages.

If the size of the indexed web grows proportionally with the number of active sites, then there will be nearly 20 billion[1] web pages in January 2013. Remember this is only the visible web: the part that search engines can see. The deep web, the part not indexed by search engines, may well be over 55 times[2] bigger than this in 2013: about 1.2 trillion pages.

This entire methodology still relies on static web pages and dynamic web pages that work via HTTP GET. The AJAX methodology: a web page that retrieves content and updates itself dynamically without reloading, makes the situation even more complex. What counts as a page in such a set-up? Any URL can have many instantiations that depend not solely on time, but on the user’s characteristics and preferences.

So, this question is going to become increasingly hard to answer because new technologies make the meaning of `page’ and `site’ much less well defined.

read more 0 comments

Search Result Caching in Peer-to-Peer Information Retrieval Networks

Almer S. Tigelaar 06 / 06 / 2011, 15:20

Search Result Caching in Peer-to-Peer Information Retrieval Networks
Tigelaar, A. S. & Hiemstra, D. & Trieschnigg, D.
In Proceedings of IRFC 2011, Vienna, Austria (pp. 134-148).

View in Repository
View at SpringerLink

Abstract
For peer-to-peer web search engines it is important to quickly process queries and return search results. How to keep the perceived latency low is an open challenge. In this paper we explore the solution potential of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that a small bounded cache offers performance comparable to an unbounded cache. Furthermore, we explore partially centralised and fully distributed scenarios, and find that in the most realistic distributed case caching can reduce the query load by thirty-three percent. With optimisations this can be boosted to nearly seventy percent.

Presented at the Information Retrieval Facility Conference 2011 on June 6th 2011 in Vienna, Austria.

Logo of the Information Retrieval Facility

read more 1 comments

Query-Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks

Almer S. Tigelaar 04 / 02 / 2011, 12:15

Query-Load Balancing by Caching Search Results in Peer-to-Peer Information Retrieval Networks
Tigelaar, A. S. & Hiemstra, D.
In Proceedings of DIR 2011, Amsterdam, The Netherlands (pp. 28-31).

View in Repository

Abstract
For peer-to-peer web search engines it is important to keep the delay between receiving a query and providing search results within an acceptable range for the end user. How to achieve this remains an open challenge. One way to reduce delays is by caching search results for queries and allowing peers to access each others cache. In this paper we explore the limitations of search result caching in large-scale peer-to-peer information retrieval networks by simulating such networks with increasing levels of realism. We find that cache hit ratios of at least thirty-three percent are attainable.

Presented at Dutch-Belgian Information Retrieval 2011 Workshop, February 4th 2011, Amsterdam, The Netherlands.

University of Amsterdam

DIR2011 Workshop Logo

read more 0 comments

Query-Based Sampling using Snippets

Almer S. Tigelaar 23 / 07 / 2010, 11:11

Query-Based Sampling using Snippets
Tigelaar, A. S. & Hiemstra, D.

In Proceedings of LSDS-IR 2010, Geneva, Switzerland (pp. 9-14).

View in Repository

Abstract
Query-based sampling is a commonly used approach to model the content of servers. Conventionally, queries are sent to a server and the documents in the search results returned are downloaded in full as representation of the server’s content. We present an approach that uses the document snippets in the search results as samples instead of downloading the entire documents. We show this yields equal or better modeling performance for the same bandwidth consumption depending on collection characteristics, like document length distribution and homogeneity. Query-based sampling using snippets is a useful approach for real-world systems, since it requires no extra operations beyond exchanging queries and search results.

Presented at the Large-Scale Distributed Systems for Information Retrieval Workshop on July 23rd in Geneva, Switzerland.

sigir2010-13
More Photos

Large-Scale and Distributed Systems for Information Retrieval Workshop Logo

read more 0 comments

Bertold van Voorst: Cluster-based Collection Selection in Uncooperative Distributed Information Retrieval

Almer S. Tigelaar 13 / 07 / 2010, 14:00

Cluster-based Collection Selection in Uncooperative Distributed Information Retrieval
by Bertold van Voorst

View in Repository

Abstract
The focus of this research is collection selection for distributed information retrieval. The collection descriptions that are necessary for selecting the most relevant collections are often created from information gathered by random sampling. Collection selection based on an incomplete index constructed by using random sampling instead of a full index leads to inferior results.

In this research we propose to use collection clustering to  compensate for the incompleteness of the indexes. When collection clustering is used we do not only select the collections that are considered relevant based on their collection descriptions, but also collections that have similar content in their indexes. Most existing cluster algorithms require the specification of the number of clusters prior to execution. We describe a new clustering  algorithm that allows us to specify the sizes of the produced clusters instead of the number of clusters.

Our experiments show that that collection clustering can indeed improve the performance of distributed information retrieval systems that use random sampling. There is not much difference in retrieval performance between our clustering algorithm and the well-known k-means algorithm. We suggest to use the algorithm we proposed because it is more scalable.

read more 0 comments

Automatic Summarisation of Discussion Fora

Almer S. Tigelaar 24 / 03 / 2010, 23:59

Automatic Summarisation of Discussion Fora
Tigelaar, A. S. & Akker, R. op den & Hiemstra, D.
Natural Language Engineering Volume 16, Issue 2, 2010, ISSN 1351-3249, (pp. 161-192).

View at Cambridge Journals On-Line
View in Repository

Abstract
Web-based discussion fora proliferate on the Internet. These fora consist of threads about specific matters. Existing forum search facilities provide an easy way for finding threads of interest. However, understanding the content of threads is not always trivial. This problem becomes more pressing as threads become longer. It frustrates users that are looking for specific information and also makes it more difficult to make valuable contributions to a discussion. We postulate that having a concise summary of a thread would greatly help forum users. But, how would we best create such summaries? In this paper, we present an automated method of summarising threads in discussion fora. Compared with summarisation of unstructured texts and spoken dialogues, the structural characteristics of threads give important advantages. We studied how to best exploit these characteristics. Messages in threads contain both explicit and implicit references to each other and are structured. Therefore, we term the threads hierarchical dialogues. Our proposed summarisation algorithm produces one summary of an hierarchical dialogue by ‘cherry-picking’ sentences out of the original messages that make up a thread. We try to select sentences usable for obtaining an overview of the discussion. Our method is built around a set of heuristics based on observations of real fora discussions. The data used for this research was in Dutch, but the developed method equally applies to other languages. We evaluated our approach using a prototype. Users judged our summariser as very useful, half of them indicating they would use it regularly or always when visiting fora.

read more 0 comments

Query-Based Sampling: Can we do Better than Random?

Almer S. Tigelaar 23 / 02 / 2010, 17:00

Query-Based Sampling: Can we do Better than Random?
Tigelaar, A. S. & Hiemstra, D.
Technical Report TR-CTIT-10-04 (2010), Centre for Telematics and Information Technology, University of Twente, Enschede, The Netherlands, ISSN 1381-3625.

View in Repository

Abstract
Many servers on the web offer content that is only accessible via a search interface. These are part of the deep web. Using conventional crawling to index the content of these remote servers is impossible without some form of cooperation. Query-based sampling provides an alternative to crawling requiring no cooperation beyond a basic search interface. In this approach, conventionally, random queries are sent to a server to obtain a sample of documents of the underlying collection. The sample represents the entire server content. This representation is called a resource description. In this research we explore if better resource descriptions can be obtained by using alternative query construction strategies. The results indicate that randomly choosing queries from the vocabulary of sampled documents is indeed a good strategy. However, we show that, when sampling a large collection, using the least frequent terms in the sample yields a better resource description than using randomly chosen terms.

read more 0 comments

Koen Lavooij: Near-real Time Statistics Gathered from a Continuous and Voluminous Data Mutation Stream

Almer S. Tigelaar 17 / 02 / 2010, 15:45

Near-real Time Statistics Gathered from a Continuous and Voluminous Data Mutation Stream
by Koen Lavooij

View in Repository

Abstract
The amount of digital data is growing fast. Providing that information as a service is not enough, with the amount of information available. To support the users in finding information, supporting systems have been developed to extract specific information from a large amount of stored data.

Finding or extracting interesting information is as least as important as providing the original data. The “collective intelligence? of a large number of users can be used to order the information. The ordered information is of much greater value when compared to the unordered information, because it provides the user with an overview of interesting and less interesting information.
Current database systems are not able to provide ranked information by analyzing a massive amount of user feedback (e.g. clicks) within a short period of time. Therefore, the systems update the answers periodically.

In this thesis, a Stream Processing Engine (SPE) is being adapted. The modified SPE accepts a stream of mutations to a virtual data storage as opposed a stream of tuples. The newly created system exploits the properties of statistical functions in order to efficiently aggregate live statistics over a large stream of mutations.
The newly created system is able to provide answers to a small set of continuous queries. The answers to the queries will be continuously maintained, instead of recalculated. Therefore, the system is able to provide the answers to the continuous queries instantly and with low latency for a large number of users.

read more 0 comments

Query-Based Sampling using Only Snippets

Almer S. Tigelaar 02 / 12 / 2009, 17:00

Query-Based Sampling using Only Snippets
Tigelaar, A. S. & Hiemstra, D.
Technical Report TR-CTIT-09-42 (2009), Centre for Telematics and Information Technology, University of Twente, Enschede, The Netherlands, ISSN 1381-3625.

View in Repository

Abstract
Query-based sampling is a popular approach to model the content of an uncooperative server. It works by sending queries to the server and downloading the returned documents in the search results in full. This sample of documents then represents the server’s content. We present an approach that uses the document snippets as samples instead of downloading entire documents. This yields more stable results at the same amount of bandwidth usage as the full document approach. Additionally, we show that using snippets does not necessarily incur more latency, but can actually save time.

read more 0 comments

Query-Based Sampling using Only Snippets (Poster)

Almer S. Tigelaar 29 / 09 / 2009, 15:00

Query-Based Sampling using Only Snippets
Tigelaar, A. S. & Hiemstra, D.
Poster NWO SIREN 2009, Enschede, The Netherlands.

Depicts the Query-Based Sampling Process

read more 0 comments