P2P Search Engines
How traditional Search Engines work and their disadvantages [Richard Lee]
In a couple of minutes you will know exactly how a peer-to-peer search engine works. In case you don't already know however, this section will first introduce you to the concept of a search engine and describe how traditional, client-server based search engines work. If you already know all this, feel free to skip straight to the next section.
The World Wide Web (WWW) consists of literally billions of web pages, spread across thousands and thousands of servers all over the world. Since it would be physically impossible for an individual to sift through and examine all these pages to locate specific information, search engines exist to do this work for you.
Search engines use special software called "spiders" which roam around the web, automatically following hyperlinks from one document to the next and extracting important textual information from them. It uses this information to build up a huge index correlating keywords to web pages. Search engines can't just invent urls from which spiders start their crawl however. Instead all searches originate from web pages specified manually by human users and the spider subsequently follows the links on these pages to others. You might expect that this would result a great deal of the web left undiscovered by spiders. this is indeed true - a very large portion of the web is completely invisible to spiders - still, google claims to have a massive 2 billion web pages in its database.
When a user enters a query into a search engine from their browser, their input is processed and used to search the database for occurrences of particular keywords. The web pages it finds are ordered using a ranking algorithm unique to each search engine.
Obviously the highest ranked page should be the one the user is most likely to be interested in. Ranks are usually assigned based on a weighted average of a number of relevance scores, which are derived from the number of occurrences of a word in a page, if it appears in the page title, or in a heading, or the url itself, or the meta tag, and so on. After these web pages are sorted based on relevance, the hit list is communicated back to the browser where the user can explore the results.
So today's search engines do not actually search the web directly. They merely refer to a database located on a centralised server. Of course, since web pages can (and frequently do) change at any time, this means that when you “search the web” you are really searching through stale copies of web pages, which may no longer be relevant or even exist. You will only find out if this is the case however after time is spent (trying to) retrieve the current version of the web page. So to tackle the issue of web documents being modified, moved, deleted or renamed, the index database needs to be updated continually and comprehensively to maintain high quality search results and reduce the amount of broken links reported to the user.
Naturally, searching a billion or more web pages for a specific piece of information is a very compute intensive task. This leads to very complex and expensive hardware and software strategies to reduce search times. Powerful, dedicated servers are needed both to feed urls to spiders during their crawl and to process the data (the order of megabytes per second is usual) returned by spiders. Google, as an example, has it’s own dedicated domain name server (dns), simply to bypass the overhead of querying a world wide dns. furthermore, companies like google have massive storage and bandwidth requirements, which don’t come cheap. On the software side, custom compression solutions (to reduce storage costs) and algorithms such as hashing (for fast indexing of the database) are amongst the techniques used to ensure efficiency and speed.
So while current search engines are adequate for general web searching, they still have many disadvantages. They are costly to build and maintain, they inadequately handle dynamic web pages whose content changes frequently, they are ignorant of the vast majority of web pages, which are unreachable by spiders, and the information they reference can quickly go out-of-date. Of course, all these problems are exacerbated by the rate at which the internet continues to grow, making it virtually impossible for any centralised search engine to repeatedly visit and index all publicly accessible web pages.
Peer-to-Peer Network Search Engines [Edmund So]
Effective discovery methods must rely on a larger variety of information about the desired resources, typically in the form of metadata.
The major types of discovery methods are described here
Flooding broadcast of queries
The original gnutella implementation is an example of a flooding broadcast discovery mechanism. when a peer makes a query, the query is then broadcasted to all the neighbour peers. If its neighbour peers could not solve the query, then the query is broadcasted to neighbour’s neighbour peers. If a resource is found, that peer will send a message to the origin sender of the query, indicates it can solve the query, and then establish a peer-to-peer connection.
Each query has a time-to-live (ttl) counter. Typically, the ttl is set between 5 and 7, and the value is decremented by each node as it relays the message. Another counter tracks the number of hops. Once the ttl counter reaches zero, the query will be discarded.
Due to the broadcast nature of each query, the system does not scale well (o(n2)), the bandwidth network assumption grows exponentially with a linear increase in the number of peers. Raising the number of peers in the system will cause the network quickly reach bandwidth saturation.
This type of method has the advantage of flexibility in the processing of queries. Each peer can determine how it will process the query and respond accordingly. It is simple to design and efficient. unfortunately, it is suitable only for small networks. As well as that, this type of mechanism is very susceptible to malicious activity, rogue peers can send out large number queries, which produce a significant load on the network.
Selective forwarding systems
Selective forwarding systems are more scalable then flooding broadcast systems. Instead of sending a query to all peers, it is selectively forward to specific peers who are considered likely to locate the resource. Peers will become super peer automatically if they have sufficient bandwidth and processing power, i.e. if a peer has broadband connection and higher processing power. Peers with dial-up connection (low bandwidth) will make queries to super peers. this type of systems use flow control algorithm(fca), it tries to apply a form of intelligent flow control to how a peer forward request and response messages and a sensible priority scheme to how it drops messages that won’t fit into the connections.
With this approach, it greatly reduces bandwidth limitations to scalability. but it is susceptible to malicious activity, a rogue peer can insert itself into the network at the various points and misroute queries, or discard them altogether.
Each peer must also contain some amount of information used to route or direct queries received. the size of this information is negligible in a small network, but in large networks, this overhead may grow to levels that are unsupportable, hence it is not suitable for a large peer network.
Decentralized hash table networks
In decentralized hash table networks, each file stored with in the system is given a unique id, typically an sha-1 hash of its content, which is used to identify and locate a resource. Given this unique id, a resource can be located quickly despite the size of the network. Since each resource is identified by this key, it is impossible to perform a fuzzy or keyword search within the network. if a peer is looking for a file from another peer, it must obtain this key first in order to retrieve the file.
These systems are also susceptible to malicious activity by rouge peers. They may discard a query, insert large amount of frivolous data to clutter the keyspace, or flood the network with queries to degrade the performance.
Centralized indexes and repositories
Napster uses centralized indexed and repositories system. Index of all peers and their resources are kept on large indices on main server. A query is sent to a server, then the server will look-up the index, if the query can be solved, then the server will send a message to the origin query sender, about where he could get the file.
Centralized indexes have provided the best performance for resource discovery. The server in centralized indexes and repositories system is expensive, the bandwidth and hardware required to support large networks of peers are expensive.
If the server in the system fails to function properly, it brings down the network. in the case of napster, it has a cluster of servers, if one server fails, the rest of the servers will continue to support the network.
Recent court rulings cast serious doubt about the liability involved in using centralized servers to index resources in a peer based network. It has been said that the recent legal precedents require any such system to monitor usage and activity of the network exactly to ensure that no types of copyright violations are occurring. The ability to monitor and enforce this requirement is quite challenging, and may be too much of a risk.
Distributed indexes and repositories
With this approach, we could eliminate the need for expensive centralized servers, the idea of distributed index is that, each content broker in the network keeps an index of local files as well as an index of some files stored in some neighbouring content broker, when a content broker receives a query from a peer, it first checks to see if the query can be satisfied locally. If it cannot, it uses the local index to decide which content broker to forward the request to. The index on each server is not static and changes as files move through the system.
If it is well designed, it provides the best performance and scalability of any solution, it also has a very high tolerance of signal point failure, because a content broker only contains a relative small indexes in compare with centralized server, so if one content broker goes down, the network will still function properly.
The most difficult problem with this type of indexing systems is cache coherence of all the indexed data. if a file is changed locally by a peer, then the content broker would not aware that particular file has been changed, later, when another peer request that particular file, the content broker will return an out-of-date copy of that file. The overhead in keeping everything up-to-date and efficiently distributed is a major detriment to scalability.
Peers joining and leaving the network from time to time, when a peer leaves a network, all the resources indexes stored in that peer will become unavailable to other peers.
Distributed indexing systems as they currently exist cannot provide robust discovery in large networks.
Relevance driven network crawlers
Relevance driven network crawlers are a different approach to the resource discovery problem. Instead of performing a specific query based on peer request, they use a database of existing information the peer has accumulated to determine which resources it encounters may or may not be relevant or interesting to the peer.
Over time, a large amount of information is accrued which is analysed to determine what common elements the peer has found relevant. the crawler then traverses the networks, usually consisting html documents for new information, which matches the profile distilled from the previous peer information.
The time requires for the crawler to traverse a large amount of content is very long, it is not suitable for large networks.
Current P2P Search Implementations – Michael Collins
Two main models of p2p networks for file sharing have evolved:
The centralized server-client model
The decentralized model.
Searching centralized server-client p2p networks
Searching on a centralized p2p network is made easy by the presence of a single central server system which maintains directories of the shared files stored on the respective pcs of every user on the network. When a user searchs for a file the central server creates a list of files matching the search request, by cross-checking the request with the server's database of files belonging to users who are currently connected to the network. The central server then displays that list to the requesting user, the requesting user can then choose files from the list and make direct connections to the individual computers which currently posses that file.
Advantages of the server-client architecture
The principle advantage of the server-client architecture is the central index which locates files quickly and efficiently. Also because all clients have to be registered as part of the network search requests reach all logged on clients which ensures the search is as through as possible.
Disadvantages of the server-client architecture
The central server system provides a single point of failure and a visible target for legal attacks on the network. Also because the central server index is only updated periodically there is a possibility of a client recieving outdated information.
Server-client p2p networks currently operating today include:
opennap - Open source, a network of servers running the napster server-client protocol.
kazaa - uses a semi-centralized architecture, and is the underlying network for morpheus and grokster. A proprietary protocol, it uses a self-organizing network that automatically assigns more tasks to peers with higher bandwidth.
edonkey - also a semi-centralized network. anyone is free to run a server. There are loosely connected, separate index servers. when a client sends a search request to a server only that server is searched, once the search is completed the client can send the same request to the next server in its list if necessary. Once a file is in the clients download queue further servers are queried. This is because edonkey was the first client application to simultaneously download from multiple sources and it also supports sharing of partially downloaded files so other clients can start downloading a file from you whilst you are still downloading it.
filetopia - a spanish developed client-server application which uses strong public key encryption to ensure security and anonymity.
Searching decentralized p2p networks
The concept of decentralization is to remove the central structure of a network such that each peer can communicate as an equal to any other peer. When a peer (a) connects to a decentralized network it connects to another peer (b) to announce that it is live, b will then in turn announce to all peers it is connected to (c, d, e, f etc.) that a is alive, c, d, e, f etc repeating the pattern. Once a has announced that it is alive it can send a search request on to b, which in turn passes it on to c, d, e, f etc. If for example c has a copy of the file requested by a it transmits a reply to b which passes it back to a which can then open a direct connection to d and download the file. As illustrated by this flash animation.
Although this theoretically allows for an infinite network, in practice a time to live (TTL) is used to control the number of nodes a request can reach.
Advantages of a decentralized architecture
They are more rugged, because a single point of failure is eliminated. They are also harder to shut down.
Disadvantages of a decentralized architecture
Searching a decentralized network is slower. You are not guaranteed to find a file even if it is on the network because it may be too far away for a search request to reach the peer which has it before the TTL expires.
Decentralized p2p networks currently operating today include:
Gnutella is by far the most popular alternative to the opennap network. The protocol has been reverse engineered and put under the control of an open source foundation. It has also been extended to support hashing, quality of service and multi-source downloading.
mnet grew out of a commercial product mojonation. it is a universal file space, however a filesharing application was the first application developed for it.
The freenet project is more then just a file sharing p2p application. it allows completely anonymouse information exchange. It also implements caching and each filename is hashed so servers do not know what they are storing and uses strong encryption.
gnunet is an attempt to create an anonymous, distributed, reputation based network in which all connections are authenticated and encrypted.
The origins of JXTA search – Sean Reilly
JXTA search originated with Gene Khan and a company he founded called infrasearch. Infrasearch was developed after khan realised that gnutella was a distributed searching network and could be used to access all manner of data completely independent of the of format (i.e. not just mp3s etc.). He found that typical web crawlers had stale data taking weeks for newly posted documents to be available on the web and also that crawlers didn’t access large databases that were open to the web especially after a “?” in the url. To solve these problems he came up with a prototype of infrasearch base on gnutella and using the gnutella backend. The basic idea behind infrasearch was to distribute the query to the edges of the network and let the intelligence of the peer it is being sent to process it in whatever format is appropriate for the query and respond. Infrasearch was bought by sun microsystems in march 2001 and the development team were incorporated into sun’s JXTA project, with the intention of using the ideas developed by infrasearch to develop JXTA search the search method used in project JXTA. The infrasearch team acquired by sun moved away from using the gnutella backend and developed their own xml based protocol which was based in some ways on the gnutella protocol, e.g. the notion of letting each peer process the queries as it sees fit, and the distribution of queries among peers.
The JXTA search engine consists of the following participants,
JXTA search information providers , anything that responds to requests formatted in the JXTA search qrp language information providers may be JXTA peers or web servers, such as cnn.com.
JXTA search consumers, anything that makes requests in the JXTA search qrp language. consumers may be JXTA peers or web sites with http client interfaces to the JXTA search network.
- JXTA search hub, a mechanism that facilitates efficient query routing over the JXTA search network by handling message routing between consumers and providers. Providers register a description of themselves with the hub and wait for matching requests. Consumers query the network through the hub and await responses.
Hubs can also be providers or consumers they can be chained together in a network.
Applications send requests to their nearest search hub which then forwards these requests to appropriate service providers based on meta-data they provide when registering with the search hub. the provider then sends back the response to the hub that it requested it and it then travels from hub to hub until it reaches the application that requested it.
Wide and deep search methods
JXTA search has two complementary search techniques wide and deep search.
Wide and deep search in the JXTA network source: search.JXTA.org
In the JXTA search context wide search means sending data from search hub to search hub. These search hubs are an efficient compromise between a central server bottleneck issue on one extreme and the wasted bandwidth/resources issue of a completely connected network. Each hub is intended to be specialized in some way i.e. content application etc. and if the hub cannot answer the request (or if it is intended to go as wide as possible) it forwards the request to other hubs.
Web crawlers often have stale data and are only effective for static content such as home pages etc. dynamic pages such as news sites etc. are not effectively indexed by web crawlers. Deep search engines find data in large databases such as amazon or large news sites. JXTA search decides which queries should be sent to them and returns the results. This is intended by the JXTA developers to give a much wider better coverage of the data then conventional search engines.
The future of JXTA search
Suns project JXTA has been released to the open source community in an initiative that sun hope will help project JXTA become the number one peer2peer platform. This is a very important step in project JXTAs development because in the end it will be the number of people using the JXTA platform that decides whether or not project JXTA will be a success or not. At present sun claim to have over 10,000 members in the project JXTA community and still growing. including the open source community in the project will i think result in much bigger interest from the developer community and therefore many more applications using the JXTA p2p platform being written. hopefully in the near future we will see some successful implementations of JXTA search coming into widespread use and so we will be able to properly gauge the success or otherwise of JXTA search.
Further information and references:
“peer to peer”, bo leuf, addison wesley 2002
“distributed search in peer-to-peer networks” steve waterhouse, david m. doolin, gene kan, yaroslav faybishenko
Aside: What is meta data? [Edmund So]
Metadata is sometimes defined literally as ‘data about date’, it has label like “title”, “author”, “type”, height” and “language” used to describe a book, person, article etc.
Example of metadata in an html document
<title>how does peer-to-peer search engines work</title>
<meta name=”description” content=”this article addresses…”>
<meta name=”creator” content=”metadata, rdf, peer-to-peer”>
<meta name=”publisher content=”trinity college dublin computer science department”>
<meta name=”date” content=”2002-11-24t00:12:00+00:00”>
<meta name=”type” content=”article”>
<meta name=”language” content=”en-ie”
The original html mechanism for embedding metadata has been proven limited. There is no built-in convention to control the names given to the various embedded metadata fields, and these fields are often ignored by search engine.