There are many peer-to-peer networks in existence today, each using different
techniques to route data from one peer to another, to discover resources
on the network and to adapt to changing topologies. This webpage describes
the routing techniques employed by four distinct schemes to overcome these
GNUtellaBefore one can give a detailed outline of the routing algorithms used by GNUtella it is essential to give a broad outline of what such a system is trying to achieve in general. Most operations on the Internet work on a client-server model. This system means that individual client machines interact with a central server to retrieve information stored on that server. This means that the majority of information is moving from the server to the client. The GNUtella protocol works on the basis that servers can act as clients and vice versa i.e. a decentralized system. These individual systems are called GNUtella servents (SERV -er cli- ENT). The servents can accept queries from another servent which is trying to access files on its system, while at the same time requesting files from another servent's file system. Communication between the servents is carried out using the TCP/IP protocol.
Once a servent is connected to the network it can send information into the network to find out about other servents in the system. The other servents can respond to this request by sending information about their own state, including their IP addresses, the number of files it has decided to share on the network and the total size of these files. A servent can then query the network for files meeting certain search criteria. If a servent has files meeting the criteria, it will respond with a list of appropriate file details. The servent that made the initial request can then ask an appropriate servent for a particular file and have that file routed to it.
Servents communicate with each other using a set of descriptors. There are currently 5 descriptors defined: Ping, Pong, Query, QueryHit and Push. These will be described in detail later.
Each descriptor is preceded by a Descriptor Header which has the
|Byte Positions||0 - 15||16||17||18||19 - 22|
|Contents||Descriptor ID||Payload Descriptor||Time To Live||Hops||Payload Length|
Descriptor ID: This is used to uniquely identify the particular message on the network. It is created by the client and must be unique (in theory) to ensure that certain other servents can detect when they are seeing a message from a particular servent that they have processed before.
Payload Descriptor: This defines the type of descriptor which is following the header. It can be any of the 5 choices.
Time To Live (TTL): This field outlines the maximum amount of servents that the message can be routed through before it must be discarded. Each time the message passes through a servent this field is decremented. When the value reaches 0 the message is discarded. This ensures that a particular message will not be routed continually around the network. This would cause an immediate degradation of network performance if all servents were sending out many packets in succession which were never being removed from the network.
Hops: This is a count of the number of servents through which the message has been passed. The count is incremented each time it passes through a new servent.
Payload Length: The length of the payload immediately following
this header. This header is important because the protocol does not define
any flags to define where one payload ends and the next descriptor begins.
This means that if the fields in the descriptor header are invalid then
that message cannot be routed. As a result it is impossible to find the
beginning of the next descriptor and the connection must be terminated.
Ping, Pong, Query, QueryHit and PushThe following are the 5 descriptors which can follow the descriptor header outlined above:
PingA servent uses a ping descriptor as the mechanism to find out about other servents on the network. This is a message sent with no payload. This means that the descriptor header contains all the necessary information.
|Byte Positions||0 - 1||2 - 5||6 - 9||10 - 13|
|Contents||Port||IP Address||File Count||Total File Size|
This is a response that a servent may send upon receiving a ping. The response contains the following fields:
Port: This is the port on which the servent will listen for communication
IP Address: IP address of the servent
File count: This is a count of the number of files that the servent is sharing with other servents.
Total File Size: The total size of all the files that the servent is sharing on the network in kilobytes.
|Byte Positions||0 - 1||2 - N|
|Contents||Minimum Speed||Search Query|
When a servent wants to find a file on the network it sends out a query descriptor, which contains the following fields:
Minimum Speed: If a particular servent is to communicate with the requesting servent it must be able to communicate at or above the specified minimum speed. Only servents that meet this criterion can send a response to this query.
Search Query: This is a string of search data that contains the request of the servent. This string must be null terminated to mark the end of the descriptor.
|Byte Positions||0||1 - 2||3 - 6||7 - 10||11 - (N-1)||N - (N+16)|
|Contents||Number of Hits||Port Number||IP Address||Speed||Result Set||Servent ID|
This descriptor is sent in response to a query placed on the network, but only if the responding servent can communicate at the speed outlined in the minimum speed field of a Query and if it contains a file which matches the search string. It contains the following fields:
Number of hits: The number of files that the servent has which match the query string. The individual file details are grouped in sets. These set structures will be described below.
Port Number: The port on which the servent will listen for communication.
IP Address: The address of the servent.
Speed: The speed of the servent in KB per second.
Result set: The result sets contain file details. These outline the index of the file stored on the host (this is number given to a file by GNUtella), the size of the particular file (in bytes) and a string giving the name of the file found. This is terminated by a double null. The index of the file and the size of the file are both 4 bytes long. The string for the file name is at least 1 byte long.
Servent ID: This uniquely identifies a servent on a network.
|Byte Positions||0 - 15||16 - 19||20 - 23||24 - 25|
|Contents||Servent ID||File Index||IP Address||Port|
If the servent is behind a firewall that does not allow incoming connections, the client can use a push descriptor to retrieve the data. The descriptor contains the following fields.
Servent ID: This is the unique ID of the servent which holds the required files. This is obtained from the QueryHit descriptor send back by that particular servent.
File Index: This is the mechanism for the requesting servent to indicate which file it wants from the servent with servent ID. This index is obtained from the Query Hit message descriptor.
IP Address: Address of the host to which the file is to be transferred.
Port: Port number to which the file is to be transferred.
Routing Rules for descriptors
To try to minimise the amount of traffic and nodes involved in the transferral of packets around the network each servent must contain some form of routing table. This table is used by a servent to determine whether a particular descriptor has been seen before. In this way the servent can decide if a message needs to be forwarded or removed from the network. It is best to describe the process using an example:
Take the situation where node A is joining the network and wants to find out about the other nodes on the network. It does this by sending a ping to node B. On receipt of a ping node B looks up its routing table to see if it has seen that particular ping before. Initially is has not, so B enters details about the message into the routing table. It responds to A by sending a pong descriptor containing its details. It then forwards the ping from A to the other nodes to which B is connected i.e. C, E & F. These nodes carry out the same operation, storing the details in the routing table if necessary, sending back a response to A and forwarding the message on. Due to the fact that a loop exists B will receive the ping message back around the loop. This scenario is dealt with when C & E check their routing tables and recognise that they have seen that same ping message before. In these cases the message can be discarded.
When node B receives the pong messages from C, D and E for A it can look up the routing table and see that it routed the corresponding ping message for these pong messages and forward these on to A.
If the network were further expanded the ping message from A would propagate its way through the network until the TTL on the ping message expired, at which point the message would be removed from the network. This would happen if the network were large in relation to the TTL.
The same form of routing is used for querying the network and receiving
The major advantage of the GNUtella routing system is that unlike the well-known
Napster protocol, GNUtella does not use a centralized server. Less reliance
on central servers means that failures will affect very few servents.
GNUtella was designed to transfer any type of file, unlike Napster which would only allow the transfer of sound files.
The major disadvantage is that when using descriptors such as ping and
query the descriptor must be forwarded by each node to all its neighbours
or until the TTL is 0. This means that a single ping message causes each
neighbouring servent to send ping messages to their neighbours which in
turn may generate thousands of pong messages going back to the servent which
issued the original ping. The responses to these descriptors are sent in
a more efficient manner, along the path travelled by the initial request.
Another disadvantage is the fact that each servent must store some information about what packets it has seen over a period of time. This seems to be something of a dilemma, as to keep a very good history of packet routing a servent would need to use a lot of memory. On the other hand if the servent has a small memory store it may end up forwarding packets that it had previously forwarded, creating unnecessary network traffic. The issue of using small or large history storage is dependent on the size of the network, and the GNUtella network does not follow any particular structure which means that the size of the network is always changing.
The protocol depends on a servent producing a globally unique identifier for a particular message. This in practice does not happen and as a result the overall system breaks down if two messages that are supposed to be unique, i.e. sent from different servents, end up having the same identifier.
There will also be problems if one servent is about to fetch files from another servent and the sending servent goes off and back onto the network with a different file index. The receiving servent may then be indexing incorrectly into the file system for a file that does not exist or retrieving the wrong file. This case would also mean that the servent would have lost all of its history.
IntroductionJXTA (pronounced "juxta") was developed by SUN and launched on 25 April 2001. It was intended to be an open standard platform which could be used for the development of a wide and varying range of distributed applications. In the relatively short space of time it has been in use it has become very popular. This platform provides tools and utilities allowing devices such as mobile phones, PCs, PDAs and servers to communicate and share information over a virtual network. Each device connected to the network is known as a peer. Peers are able to set up virtual networks in order to access, find and use the resources of other peers.
SUN had three main goals for JXTA. These were Interoperability, Platform
Independence and Ubiquity.
- Interoperability: this allows peers to locate each other easily as well as allow them to set up community activities and share resources seamlessly.
- Platform Independence: peers should be able to communicate and operate regardless of what operating system, language or networking protocol is used.
- Ubiquity: JXTA is designed to be usable on any digital device.
Unlike most other P2P systems JXTA has designed its network from a much lower level and has designed an entirely new infrastructure unrelated to any other existing P2P scheme. Peers are divided into peer-groups. Essentially the JXTA virtual network is a collection of peer groups. This approach is adopted for communication, security and performance reasons. Six core protocols have been defined by JXTA to allow peers to detect one another as well as set up peer groups without the need for network management. It can also operate through firewalls. Message passing is achieved using unidirectional pipes and the data passed between peers is in the form of XML documents.
Fig. 2.1: JXTA Virtual Network
JXTA MessagesCommunication between peers is achieved using XML documents. Messages are sent from one peer to another as an ordered sequence of bytes in a single message unit. Messages are sent between peer endpoints. An endpoint is a destination on any network capable of receiving datagram-style messages. The mapping of endpoints to a physical address is performed at runtime by a messaging layer. This layer uses the transport specified by the URI for message passing. JXTA can use connection oriented (TCP/IP) as well as connectionless transport protocols (UDP). JXTA can also support emerging transport protocols.
Messages can be viewed as an envelope with a stack of protocol headers. The envelope contains a header, message digest, source endpoint and destination endpoint. The source endpoint is optional. The header consists of a protocol tag and message body length. The protocol tag specifies the protocol in use. The body's size is dependent on the protocol in use. When using an unreliable transport protocol messages may arrive at the endpoint more than once, in a different order than sent or may not arrive at all. Higher layers above the messaging layer are responsible for message re-ordering and duplicate message removal.
Also contained within the body of a message is a credential key that is used at the endpoint to identify a sender. It also ensures that the sender has a right to send a message to that specific endpoint. This credential must be used each time a message is sent.
|jxta:// Envelope Version|
|Message Digest (Kind, Length)|
|Message Body Header (Protocol Tag)|
|Message Body (Text, XML, ...)|
JXTA PipesPipes form the virtual communication channels uses to send and receive messages between peers. They are unidirectional, stateless and provide abstraction from the type of network being used. The pipes can connect one or more endpoints. At each endpoint it is assumed that there is software that will manage the sending and receiving of messages using the pipe. However this is not obligatory. The peer endpoints each have two associated pipes; an input pipe and an output pipe. Using the pipe binding protocol pipes are bound at runtime. When a peer sends a message into a pipe all other peers connected to that pipe will receive it. Pipes that are currently connected can be obtained using the pipe binding protocol.
Pipes can support two modes of communication. Firstly there is the standard point-to-point mode. Using this mode exactly two pipes are connected together, an input pipe and an output pipe. The input pipe will receive messages sent through the output pipe. There is no support for a reply operation. The additional information in the message envelope like the credential key is used to thread messages together. The second mode used is the propagate mode. This is where multiple input and output pipes are connected to network endpoints. Using this a message can be sent to all listening input pipes. Using propagate pipes may cause messages to be duplicated. As a result of this they are not always favoured, especially when optimal speeds and times are required for communication.
JXTA Message Routing ProtocolsAs mentioned earlier JXTA uses six protocols specifically designed to work together to allow for discovery, monitoring, organization and communication between peers. The protocols are used by peers to advertise and discover available resources. Messaging routing is cooperative allowing all peers to be connected. The protocols are:
- Peer Discovery Protocol
- Peer Resolver Protocol
- Peer Information Protocol
- Pipe Binding Protocol
- Endpoint Routing Protocol
- Rendezvous Protocol
Peer Discovery ProtocolThis protocol allows peers to advertise their resources and find out what resources are available from other peers. Advertised resources are represented as XML documents. This protocol supports finding peers without knowing their name. Using a process known as screening it can also be used to probe network peer groups for peers that belong to a specific group. Screening works by presenting each possible candidate with a peer group name. Any that belong to the group will respond to the message. A list of peers may be cached allowing for unicast rendezvous communication instead of using broadcast messages. Messages to get advertisements from a certain part of the network will not always return an exhaustive list and could even be empty.
- Discovery Query Message - used to find peers or peer groups. Message contains a threshold value indicating the maximum number of matches required.
- Discovery Response Message - used to answer a discovery query message.
Peer Resolver ProtocolThis allows peers to send a query to one or multiple peers and receive a response (responses) to the query. The query is sent with a unique id that is used in the response to identify that it is the response for that query. Such queries can be sent to a whole peer group or to specific peers. This protocol performs confirmation of messages and drops any rogue messages.
- Resolver Query Message - used to send a resolver query request to another peer. Query contains unique ID.
- Resolver Response Message - used to respond to a resolver query message. Response sent with unique ID of sender.
Peer Information ProtocolAllow peers to find out status information about other peers i.e. state, uptime, and capabilities. The message sent by the original sender contains an unique ID which must be used in the response. This protocol can also be used to obtain peer information. In the responding message each property of interest is listed as a name/value pair.
- Ping Message - sent to check if the peer is active and get information about the peer. The ping option specifies whether a simple response or full response is required.
- PeerInfo Message - sent in response to a ping message.
Pipe Binding ProtocolThis is a mechanism that allows peers to set up a virtual communication channel (pipe) between itself and one or more other peers. Essentially it binds the two ends of a pipe thus creating this virtual channel. Pipes are the most common mechanism used to allow peers to communicate.
- Pipe Binding Query Message - a peer seeking a pipe endpoint bound to a particular peer sends this message.
- Pipe Binding Answer Message - the response is sent back to the requesting peer by each peer bound to the end of that pipe.
Endpoint Routing ProtocolThis protocol allows peers to discover a route between itself and another peer. This route would have been that used to send a message to the peer. If peer X and Y wish to communicate yet there is no direct link between the two this protocol has to be used. It will find the intermediary peers through the message will have to be routed. If that route is then broken the protocol can be used to find the other possible routes.
- Route Query Request - sent by one peer router to another requesting route information.
- Route Answer Request - sent in response to a route query request.
Rendezvous ProtocolThis protocol allows peers to become members of a propagation service. As a member of a propagation service peers will receive messages and information sent out to all members of that service. Thus the protocol allows peers to send messages to all members of the service. To become a member of a peer group a form must be obtained listing the requirements that must be met in order to join. This form is a structured document.
- Membership Apply Message - message sent to the application authenticator. A response will contain an application credential and a group advertisement that lists its membership service.
- Membership Join Message - sent by a peer to the application authenticator. This is an application to join the group. In order to do so the peer must pass an application credential. A successful response will include a full membership credential.
- Membership ACK Message - this message is sent by the membership authenticator to indicate to a peer whether it has been granted an application or full membership.
- Membership Renew Message - sent to renew access to a peer group.
- Membership Cancel Message - sent to cancel membership to a peer group.
Fig. 2.2: Message Passing Protocols
Traditional approaches to location services and routing over networks place
a large strain on components within the network infrastructure. Tapestry
aims to provide an alternative to the traditional network method. Using
only point-to-point links and no centralized resources, Tapestry is an overlay
location and routing infrastructure providing location-independent routing
of messages directly to its closest copy of an object or service. Tapestry
can boast features such as load balancing, fault-resilience, robustness,
scalability, and self-organisation.
The Tapestry routing architecture
Using the Plaxton location and routing mechanism Tapestry's goals are adaptivity, self-management and fault-resilience. Tapestry's routing architecture efficiently routes requests to content even in the presence of node faults and heavy loads on the network. Through the use of randomness it achieves routing locality and load distribution. Tapestry will deal with network problems by bypassing failed routes, removing nodes under attack from the service, transparently masking faulty components, and rapidly adapting communication topologies to current circumstances depending on the type of problem being encountered.
Due to Tapestry having its routes firmly in the Plaxton mechanisms, it
is appropriate to discuss what the Plaxton method basically is in order
to gain an understanding of the Tapestry routing mechanisms.
The Plaxton routing architecture
Plaxton routing mechanism
The Plaxton routing mechanism has at each node local routing maps called neighbour maps that incrementally route overlay messages to the destination ID. Each node has a neighbour map with multiple levels. Any level in the neighbour map has a number of entries equal to the base of the ID, where the i-th entry in the j-th level is the ID and location of the closest node. In order to find the next node, we look at its n + 1th level map, and look up the entry matching the value of the next digit in the destination ID. Using this routing method in an N-size namespace using IDs of base b guarantees that in at most LogbN logical hops any existing unique node in the system will be found, assuming consistent neighbour maps in each node.
As every single neighbour map at each node assumes that the preceding digits of each node all match the current node's suffix, it only needs to keep a small constant size (b) number of entries at each route level:
NeighbourMapSize = (entries/map) * (# of maps)
= b * LogbN
This results in a constant-size neighbour map being produced.
Fig. 3.1: Plaxton routing example.
Plaxton location mechanism
The Plaxton location mechanism allows a client to locate and send messages to a named object residing on a server. This is achieved by the server advertising that it has an object. It does this by routing a message to the root node, a unique node in the network that is used to hold the root of the object. Due to the root node's critical purpose and being the only one of its kind it becomes a single point of failure. Publishing is a process whereby a message is sent to the root node, and at each node along the path the <Object-ID (O), Server-ID (S)> location information is stored.
On the other hand a location query allows clients to send messages to objects. As the message progresses through its path, it checks the nodes it passes through. The message is redirected to the server containing the object if the message encounters a node that has the location information for the object sought. Otherwise, the message is forwarded one step closer to the root. The location information or mapping for the object is guaranteed to be found upon the message reaching the root. The Plaxton location mechanism chooses root nodes using a globally consistent deterministic algorithm.
Plaxton routing architecture - Review
Overall the Plaxton mechanism provides many good features for routing architecture. These include:
· Simple Fault Handling
· Exploitation of Locality (Plaxton has an explicit notion of locality)
· Proportional Route Distance (Total network distance travelled by a message during both location and routing phases is proportional to the underlying network distance)
However, there are some bad features to its design which include:
· Global Knowledge (Plaxton requires global knowledge of the network which complicates the addition and removing of nodes)
· Root Node Vulnerability (Having a critical purpose and being the only one makes the root node a single point of failure)
· Lack of Ability to Adapt (Plaxton mechanism lacks the ability to adapt to dynamic query patterns)
Tapestry location and routing mechanisms
The main features of the Tapestry mechanism are based heavily on the Plaxton mechanisms. Just like the Plaxton routing mechanism every node in the network has a neighbour map with multiple routing levels and each entry describes the location of its closest neighbour. The Tapestry location mechanism is similar to that of Plaxton's.
In the Plaxton method each node only stores the location information: <Object-ID (O), Server-ID (S)> stores the closest replica object to the node. Tapestry on the other hand stores the location information of all object replicas. This has the effect of increasing semantic flexibility.
Fig. 3.2: A single Tapestry node. The complete components of a Tapestry node that acts as a client, object server, and router.Components include a neighbour map, hotspot monitor, object location pointers, and a store of objects.
The next section explains how Tapestry mechanisms detect, operate under, and recover from faults affecting routing and location functionality.
A Tapestry design decision was that components tackle the problem of fault adaptivity by using soft state to maintain cached content for graceful fault recovery, rather than hard state which would require the providing of guarantees. The cache content is updated by regular refreshment messages. Rather than treat them as a special case, Tapestry handles faults as part of its normal operation.
Neighbour map table corruption, server failures and link failures are types of expected faults that need to be handled by the network. The system should ideally quickly detect failures, operate under them, and recover router state when failures are repaired. Tapestry relies on TCP timeouts in order to detect link and server failures during normal operations. Each node in the network contains back pointers that send regular UDP packets to its closest node neighbour. This is done to ensure that routing can be processed through that node. Corrupted tables can be quickly and easily detected by using a message to check the ID of each node it passes through. The neighbour map of a node maintains two backup neighbours in every entry in addition to its closest neighbour. If the primary neighbour of a node ever fails then the node uses the backup neighbours instead. When a node fails, it is usually found and repaired in a short time. Instead of a node removing the failed node from its neighbour map, it sets the node as invalid. Routing for the time being is done through an alternative route until the node is repaired. However the node will only remain in the network in its invalid state for fixed time (e.g. one day) to give it a chance to be repaired. Messages from the failed route are checked to see when it has been repaired. If after the specified time of repair the node has not been fixed then the node is removed from the neighbour map. This process is carried out in order to avoid costly reinsertion of recovered nodes after the failure has been repaired.
In the Plaxton mechanism the root node was vulnerable in that it was a single point of failure. In Tapestry multiple roots are used. Each object will now have many roots assigned to them. Adding a constant sequence of numbers called salt values to each object ID achieves this. During the publishing process described previously the roots are used to insert location information into the nodes. Objects stored at regular intervals have their location information republished by the storage servers.
Tapestry uses a surrogate routing mechanism, which is really just a distributed algorithm, that incrementally computes a unique root node. This algorithm is deterministic, scalable, and can arrive at consistent results from any point in the network. Surrogate routing operates by choosing an object's root node to have the same name as its ID. It is however unlikely that this node will actually exist due to the sparse nature of a nodes name space. Tapestry operates as if a node does exist by attempting to route to it. A route to a non-existent identifier will encounter empty neighbour entries at various positions along the way. Here, the algorithm selects an existing link that acts as an alternative to the desired link. This selection is done with a deterministic selection among existing neighbour pointers. Routing finishes once a neighbour map has been reached where the only non-empty entry belongs to the current node. That node is then designated as the surrogate root for the object.
Kademlia is a new (circa March 2002) peer-to-peer information system that returns search results from the entire network in O(log n) time, where n is the number of nodes in the network. Obviously this scheme is extremely scalable, but it must be realised from the outset that a given piece of content must reside on a particular node in order for it to be found and retrieved. Given that no account is taken of the bandwidth available to a node, a PC with a 56k dial-up connection may be expected to host, for example, a large media file, and to make it available for download to anyone who wants it. Clearly the system is not suitable for large-scale Naspster-style file sharing, rather as a distributed file store on a high-speed, reasonably homogenous network. Kademlia's searching and routing techniques are novel, and are worthy of investigation.
Keys & Content
In order to decide where on the network on piece of content should reside, a 160-bit hash of the content is first obtained. Hashes are assumed to be unique for different files. We can imagine a hashing scheme where a file is split into 160 pieces, and the first bit of each piece is recorded. These 160 bits could be used as our hash, as it would be very unusual to find two different files with the same bits in the given positions. Kademlia uses a technique similar to this; a hash is obtained, which it refers to as the key. The file (or other content) is called the value, and in Kademlia the key/value pair is stored on the node whose 160-bit nodeID is closest to the key. The concept of "closeness" will be discussed later. The content must reside on this particular node so that it can easily be found. For this reason Kademlia is referred to as a Distributed Hash Table. In order to find a file Kademlia simply obtains its key and retrieves the file from the node with ID closest to the key. The responsibility of having to host someone else's content makes it very unlikely that Kademlia could ever be used as a replacement for a music-sharing network like Napster or GNUtella.
Given that a piece of content can have any 160-bit key, but
only a tiny fraction of 160-bit nodeIDs are used, there must be a way of
deciding where to store content whose key is not exactly equal to any nodeID.
Kademlia solves this problem by inventing a measure of "closeness"
between two 160-bit numbers (in this case, one a key and the other a nodeID).
This is referred to as the XOR metric, because the two numbers are simply
XORed together and the resulting value is treated as a distance. For example
(with two 6-bit numbers):
X = 010101, Y = 11000, X (xor) Y = 100100 = d(X,Y) = 32, the distance between X and Y
A key/value pair is stored at the nodeID closest to the key, in the terms defined above. This metric is also used in determining how close two nodes are to one another, which is important when searching.
Each node in the network keeps contact information for only log n other nodes, its contacts. The information is made up of the nodeID, IP address and UDP port of the node. We can think of the network as a binary tree of nodes organised according to their nodeIDs (see Figure 1). Any given node must have a contact in each subtree in which it itself is not contained (in Figure 1 the black node has a contact in each of the grey ovals).
Fig. 4.1: from [Kademlia 1]
In reality each node has multiple (typically 20) contacts in each subtree. For each subtree the contacts are stored in a bucket. Figure 2 shows an example where each bucket holds at most 2 contacts.
Fig. 4.2: from [Kademlia 2]
In order to find a particular key, the searching node first determines the subtree that will contain the required node (based on the start of the key). It then asks all its contacts in that subtree to return their contacts that are in the Target subtree. The searching node "homes in" on the Target by iteratively finding contacts closer to it (see animation 1). Each step reduces the pool of candidate nodes by 50%.
The algorithm given above will find the node with ID closest to the key sought. By definition this is where the required key/value pair resides.
Nodes Joining and Leaving
Whenever one node asks another for its contacts, the called node stores the contact information of the caller. This is the primary mechanism by which each node's routing table is kept up to date. The symmetric property of XOR (X Å Y = Y Å X) ensures that a node will get the right distribution of contacts, i.e. it will get equal numbers of contacts for each oval in Figure 1. Given that each contact bucket has finite capacity it becomes necessary to discard some contacts when recording new ones. Kademlia implements a "least recently seen" eviction policy, removing contacts that have not been heard from for the longest period of time.
When a node joins the network it takes some of the contacts of an arbitrary node and uses them as its own. It then does a search for itself. This results in other nodes being called, which makes them aware of the new node's existence. Because a new nodeID has been assigned, the new node may have become the closest node to certain keys. Upon becoming aware of this, the previous closest nodes will replicate the appropriate key/value pairs to the new node. In this manner all key/value pairs will be found at the expected nodes. Ignoring replication the cost of a node joining is only O(log n) messages.
When a node leaves the network no action is required. If a searching node tries to contact this node it will timeout and will be removed from the former's routing table. Given that a searching node has multiple contacts that can be pursued in parallel, the impact on performance is minimal.
Kademlia is a very new peer-to-peer information system that is at present
largely untested. Although it is not suited to arbitrary file-swapping its
architecture could provide a sound basis for a distributed file store. Its
O(log n) scalability is its most attractive advantage, and if it can be
modified to take account of different node and network types (as its inventors
suggest) it could become a major architecture in Peer-To-Peer.
ConclusionThe vastly different approaches to data routing, resource discovery and changing topologies described above show that in many respects peer-to-peer is in its infancy. Each routing scheme has advantages and disadvantages, which means that at present there is no perfect peer-to-peer technology. However, if the history of the Internet and its growth is anything to go by, the one attribute that a routing technology must possess above all others is scalability.
"The Gnutella Protocol Specification v0.4",
Sun Microsystems Inc, "Poject JXTA: Technical Specification Version 1.0", April 25th 2001,
Dejan S. Milojicic, Vana Kalogeraki, Rajan Lukose, Kiran Nagaraja, Jim
Pruyne, Bruno Richard, Sami Rollins, Zhichen Xu, HP Laboratories Palo Alto,
"Peer-to-Peer Computing", March 8th 2002,
The Internet Society, "JXTA v1.0 Protocols Specification", Revision
1.4.2 October 10th 2002,
Sun Microsystems Inc, "Project JXTA Technology", September 2002,
Ben Y. Zhao, John Kubiatowicz, Anthony D. Joseph,
"Tapestry: An Infrastructure for Fault-tolerant Wide-area Location and Routing", April 2001
C. Greg Plaxton, Rajmohan Rajaraman, Andrea W. Richa,
"Accessing nearby copies of replicated objects in a distributed environment"
In Proceedings of the 9th Annual ACM Symposium on Parallel Algorithms and Architectures (SPAA), 311-320, 1997
[Kademlia 1] Petar Maymounkov, David Mazieres,"Kademlia: A Peer-to-Peer
Information System Based on the XOR Metric", 2002
[Kademlia 2] Petar Maymounkov, David Mazieres,"Kademlia: A Peer-to-Peer
Information System Based on the XOR Metric" (Slideshow presentation),