P2P Networks (TCD 4BA2 Project 2002/03)


1. Historical Development

2. Music and P2P

3. Copyright and P2P

4. Napster

5. GNUtella

6. YouServ

7. Freenet

8. P2P Search Engines



Peer-to-Peer Network Search Engines

Current P2P Search Implementations

JXTA Search

9. P2P Routing

10. P2P Security

Readers Guide

P2P Search Engines



Ed So sohm@tcd.ie
Michael Collins collinmg@tcd.ie
Richard Lee leerj@tcd.ie
Rob Lawless lawlessr@tcd.ie
Sean Reilly reillyse@netsoc.tcd.ie


How traditional Search Engines work and their disadvantages [Richard Lee]

In a couple of minutes you will know exactly how a peer-to-peer search engine works. In case you don't already know however, this section will first introduce you to the concept of a search engine and describe how traditional, client-server based search engines work. If you already know all this, feel free to skip straight to the next section.

The World Wide Web (WWW) consists of literally billions of web pages, spread across thousands and thousands of servers all over the world. Since it would be physically impossible for an individual to sift through and examine all these pages to locate specific information, search engines exist to do this work for you.

Search engines use special software called "spiders" which roam around the web, automatically following hyperlinks from one document to the next and extracting important textual information from them. It uses this information to build up a huge index correlating keywords to web pages. Search engines can't just invent urls from which spiders start their crawl however. Instead all searches originate from web pages specified manually by human users and the spider subsequently follows the links on these pages to others. You might expect that this would result a great deal of the web left undiscovered by spiders. this is indeed true - a very large portion of the web is completely invisible to spiders - still, google claims to have a massive 2 billion web pages in its database.

When a user enters a query into a search engine from their browser, their input is processed and used to search the database for occurrences of particular keywords. The web pages it finds are ordered using a ranking algorithm unique to each search engine.

Obviously the highest ranked page should be the one the user is most likely to be interested in. Ranks are usually assigned based on a weighted average of a number of relevance scores, which are derived from the number of occurrences of a word in a page, if it appears in the page title, or in a heading, or the url itself, or the meta tag, and so on. After these web pages are sorted based on relevance, the hit list is communicated back to the browser where the user can explore the results.

So today's search engines do not actually search the web directly. They merely refer to a database located on a centralised server. Of course, since web pages can (and frequently do) change at any time, this means that when you “search the web” you are really searching through stale copies of web pages, which may no longer be relevant or even exist. You will only find out if this is the case however after time is spent (trying to) retrieve the current version of the web page. So to tackle the issue of web documents being modified, moved, deleted or renamed, the index database needs to be updated continually and comprehensively to maintain high quality search results and reduce the amount of broken links reported to the user.

Naturally, searching a billion or more web pages for a specific piece of information is a very compute intensive task. This leads to very complex and expensive hardware and software strategies to reduce search times. Powerful, dedicated servers are needed both to feed urls to spiders during their crawl and to process the data (the order of megabytes per second is usual) returned by spiders. Google, as an example, has it’s own dedicated domain name server (dns), simply to bypass the overhead of querying a world wide dns. furthermore, companies like google have massive storage and bandwidth requirements, which don’t come cheap. On the software side, custom compression solutions (to reduce storage costs) and algorithms such as hashing (for fast indexing of the database) are amongst the techniques used to ensure efficiency and speed.

So while current search engines are adequate for general web searching, they still have many disadvantages. They are costly to build and maintain, they inadequately handle dynamic web pages whose content changes frequently, they are ignorant of the vast majority of web pages, which are unreachable by spiders, and the information they reference can quickly go out-of-date. Of course, all these problems are exacerbated by the rate at which the internet continues to grow, making it virtually impossible for any centralised search engine to repeatedly visit and index all publicly accessible web pages.


Peer-to-Peer Network Search Engines [Edmund So]

Effective discovery methods must rely on a larger variety of information about the desired resources, typically in the form of metadata.

The major types of discovery methods are described here


Current P2P Search Implementations – Michael Collins

Two main models of p2p networks for file sharing have evolved:

Searching centralized server-client p2p networks

Searching on a centralized p2p network is made easy by the presence of a single central server system which maintains directories of the shared files stored on the respective pcs of every user on the network. When a user searchs for a file the central server creates a list of files matching the search request, by cross-checking the request with the server's database of files belonging to users who are currently connected to the network. The central server then displays that list to the requesting user, the requesting user can then choose files from the list and make direct connections to the individual computers which currently posses that file.

Advantages of the server-client architecture

The principle advantage of the server-client architecture is the central index which locates files quickly and efficiently. Also because all clients have to be registered as part of the network search requests reach all logged on clients which ensures the search is as through as possible.

Disadvantages of the server-client architecture

The central server system provides a single point of failure and a visible target for legal attacks on the network. Also because the central server index is only updated periodically there is a possibility of a client recieving outdated information.

Server-client p2p networks currently operating today include:

Searching decentralized p2p networks

The concept of decentralization is to remove the central structure of a network such that each peer can communicate as an equal to any other peer. When a peer (a) connects to a decentralized network it connects to another peer (b) to announce that it is live, b will then in turn announce to all peers it is connected to (c, d, e, f etc.) that a is alive, c, d, e, f etc repeating the pattern. Once a has announced that it is alive it can send a search request on to b, which in turn passes it on to c, d, e, f etc. If for example c has a copy of the file requested by a it transmits a reply to b which passes it back to a which can then open a direct connection to d and download the file. As illustrated by this flash animation.

Although this theoretically allows for an infinite network, in practice a time to live (TTL) is used to control the number of nodes a request can reach.

Advantages of a decentralized architecture

They are more rugged, because a single point of failure is eliminated. They are also harder to shut down.

Disadvantages of a decentralized architecture

Searching a decentralized network is slower. You are not guaranteed to find a file even if it is on the network because it may be too far away for a search request to reach the peer which has it before the TTL expires.

Decentralized p2p networks currently operating today include:


The origins of JXTA search – Sean Reilly

JXTA search originated with Gene Khan and a company he founded called infrasearch. Infrasearch was developed after khan realised that gnutella was a distributed searching network and could be used to access all manner of data completely independent of the of format (i.e. not just mp3s etc.). He found that typical web crawlers had stale data taking weeks for newly posted documents to be available on the web and also that crawlers didn’t access large databases that were open to the web especially after a “?” in the url. To solve these problems he came up with a prototype of infrasearch base on gnutella and using the gnutella backend. The basic idea behind infrasearch was to distribute the query to the edges of the network and let the intelligence of the peer it is being sent to process it in whatever format is appropriate for the query and respond. Infrasearch was bought by sun microsystems in march 2001 and the development team were incorporated into sun’s JXTA project, with the intention of using the ideas developed by infrasearch to develop JXTA search the search method used in project JXTA. The infrasearch team acquired by sun moved away from using the gnutella backend and developed their own xml based protocol which was based in some ways on the gnutella protocol, e.g. the notion of letting each peer process the queries as it sees fit, and the distribution of queries among peers.

JXTA search

The JXTA search engine consists of the following participants,

Hubs can also be providers or consumers they can be chained together in a network.

Applications send requests to their nearest search hub which then forwards these requests to appropriate service providers based on meta-data they provide when registering with the search hub. the provider then sends back the response to the hub that it requested it and it then travels from hub to hub until it reaches the application that requested it.

Wide and deep search methods

JXTA search has two complementary search techniques wide and deep search.

Wide and deep search in the JXTA network source: search.JXTA.org

The future of JXTA search

Suns project JXTA has been released to the open source community in an initiative that sun hope will help project JXTA become the number one peer2peer platform. This is a very important step in project JXTAs development because in the end it will be the number of people using the JXTA platform that decides whether or not project JXTA will be a success or not. At present sun claim to have over 10,000 members in the project JXTA community and still growing. including the open source community in the project will i think result in much bigger interest from the developer community and therefore many more applications using the JXTA p2p platform being written. hopefully in the near future we will see some successful implementations of JXTA search coming into widespread use and so we will be able to properly gauge the success or otherwise of JXTA search.

Further information and references:

“peer to peer”, bo leuf, addison wesley 2002














“distributed search in peer-to-peer networks” steve  waterhouse, david m.  doolin, gene  kan, yaroslav  faybishenko


Aside: What is meta data? [Edmund So]

Metadata is sometimes defined literally as ‘data about date’, it has label like “title”, “author”, “type”, height” and “language” used to describe a book, person, article etc.

Example of metadata in an html document



<title>how does peer-to-peer search engines work</title>

<meta name=”description” content=”this article addresses…”>

<meta name=”creator” content=”metadata, rdf, peer-to-peer”>

<meta name=”publisher content=”trinity college dublin computer science department”>

<meta name=”date” content=”2002-11-24t00:12:00+00:00”>

<meta name=”type” content=”article”>

<meta name=”language” content=”en-ie”







The original html mechanism for embedding metadata has been proven limited. There is no built-in convention to control the names given to the various embedded metadata fields, and these fields are often ignored by search engine.