Peer to Peer Routing
Peer to Peer (P2P) systems can take many forms. Email, Internet Relay Chat and Napster are all examples of P2P systems. Routing on these networks is either centralised or statically configured and is therefore unproblematic. Another class of P2P networks is the overlay network. Overlay networks build a virtual topology on top of the physical links of the network. Nodes leave and join this network dynamically and the average uptime of individual nodes is relatively low. The topology of an overlay network may change all the time. Once a route is established, there is no guarantee of the length of time that it will be valid. Routing in these networks is therefore very problematic and will be the focus of our report. Some of the issues facing designers of P2P routing algorithms are:
Scalability is a measure of how a system performs when the number of nodes and/or number of messages on the network grows. Complexity is the order of steps required for a packet to travel from one host to another in a worst case scenario. Anonymity is not a requirement of most P2P networks, however if a network is to be designed to provide anonymity then this is a problem that must be solved at the routing level. We will look at a few examples of routing algorithms from each of these perspectives.
Gnutella was arguably the first mainstream overlay network to enjoy widespread use. The concept behind it was simple. To join the network a client had to know the address of at least one node already on the network. Once the client had a connection to this node it could then broadcast a ping to find the addresses of other nodes. The basic idea is that each node maintains a connection to a number of other nodes, normally about five. To search the network for a resource the client sends a "Query" message to each of the nodes it's connected to. They then forward the message and when a resource is found the result i.e. resource name and address is propogated back along the path. The number of nodes that get queried can be somewhat controlled using a Time-To-Live counter. This type of routing is the simplest kind possible for an overlay network, it is not however without its problems.
Gnutella style routing, or flooding, works well for small to medium sized networks. It has been shown that the cost of searching on a Gnutella style network increases superlinearly as the number of nodes increases. When node saturation occurs the network can become fragmented. Gnutella, therefore, while a simple solution does not scale well. Searching in a Gnutella network is roughly of exponential complexity, as each search will take about n^d steps where n is the time to live and d is the number of peers per node. Flooding is obviously not the most optimal solution for routing, and it wasn't long before other P2P routing algorithms emerged which were more efficient. We will discuss a number of these including semantic routing and distributed hash tables. Gnutella wasn't designed to be anonymous and discovering who is making specific resources available is a simple matter of performing a search. One network which was designed specifically to be anonymous is Freenet. Freenet's routing algorithm was designed completely with anonymity in mind. We will take a look at the issues that anonymity presents in terms of scalability and complexity.
Chords main advantage is the guarantee that you will get a reply within log(n) time. Also a significant advantage is the lack of redundant overhead. These both give it a huge edge on any flooding algorithm. But in general, given that DHT algorithms store their data references in an organized way, they will always beat flooding algorithms in this area. Chord scales well, given that the search is of order log(n), and also has relatively low complexity because of this.
Distributed Hash Tables
Distributed hash table (DHT) algorithms are useful for sharing files or other data across a peer-to-peer network. A hash function takes a variable length string of bytes and returns a number that it generates from this. DHT algorithms work by hashing all file/data identifiers and storing their locations in a giant hash table, which is distributed across the participating nodes. One research group in MIT has developed a system called Chord, which is a good example of a DHT algorithm.
Chord works as follows. Assume we have a single function 'hash', which functions as above. All files/data items in the network will have an identifier, which will be hashed using this function to give a key for that particular resource. If a node needs a file/data, they will hash its name and then send a request using this key.
All n nodes also use this function to hash their IP address, and conceptually, the nodes will form a ring in ascending order of their hashed IP. Obviously not all possible nodes will exist, but some will actually be real nodes, and so all real nodes deal with requests to any nodes before them in the ring, down to the previous real node (and of course requests to themselves). This gives us the notion of a successor, which is the next actual node in the circle. So, successor(x) is the next actual node in the logical ring after x. When a node wants to share a file or some data, it hashes the identifier to generate a key, and sends its IP and the file identifier to successor(key). These are then stored at this node. In this way, all resources are indexed in a large distributed hash table across all participating nodes. If there are two or more nodes that hold a given file or resource, the keys will be stored at the same node in the DHT, giving the requesting node a choice.
So when a node wants something, it will hash its identifier and send a request to successor(key), which will reply with the IP of the node that holds the actual data. But how does a node request information from successor(key), when it doesn't know its IP, but only the key? Chord has a system whereby every node holds what is known as a finger table. This contains a list of keys and their successor IP's, and is organized such that each node holds the IP of a exponential sequence of nodes that follow it, i.e. entry i of node k's finger table holds the IP of node k + 2^i.Searching for a node is therefore a log(n) procedure. Each node knows the IP of the next real node in the ring. If the key lies between k and the next real node, then successor(key) is the next node. Otherwise the finger table is searched for the closest predecessor of the key. The request is forwarded to this node and it repeats the same procedure until the node is found. The request contains the IP of the requesting node so when the search terminates at the key node, it can reply instantly, with no back propagation required.
When a node joins or leaves the network, a series of messages are sent to re-distribute parts of the hash table. On joining, a node hashes its IP to form a number and sends a request to any node on the network to find successor(number). It then takes a certain amount of its successors hast table and a message is sent to the predecessor to inform it of its new successor. When a node leaves the network, it sends its table to its successor and also tells its predecessor of its departure. On both these events, finger tables will end up with wrong values, and so each node runs a periodic process, which sends messages around the ring to find new nodes.
Chords main advantage is the guarantee that you will get a reply within log(n) time. Also a significant advantage is the lack of redundant overhead. These both give it a huge edge on any flooding algorithm. But in general, given that DHT algorithms store their data references in an organized way, they will always beat flooding algorithms in this area.
|Node 1 finger table|
|Node 6 finger table|
Semantic Routing is a method of routing which is more focused on the nature of the query to be routed than the network topology. Essentially semantic routing improves on traditional routing by prioritising nodes which have been previously good at providing information about the types of content referred to by the query.
In order to be able to search for information on a p2p network semantically the data needs to have a semantic description associated with it, one popular solution is the use of RDF meta-data for this purpose. Tagging documents/data with RDF would provide a rich 'semantic web' which could be structured in a p2p fashion. A schema-based p2p network such as this would benefit greatly from semantic routing. Semantic routing differs fundamentally from other routing techniques because prospective nodes are selected because of another node's confidence in their ability to respond correctly to a given query irrespective of their position within the network.
There have been several different projects started with a view to examining the power of semantic routing, for instance: Neurogrid, Nejdl,Wolpers,Siberski et al and Tempich,Staab,Wranik. Arguably the most interesting of these is the Remindin system which incorporates a semantic routing algorithm developed with the intention of mimicking social networks. In addition to this many of the techniques developed for collaborative filtering are equally applicable to the ranking of peers within a semantically routed network.
Each time a node answers a query its peers adjust their confidence in that node with regard to that type of query. The nature of this adjustment depends upon whether the node answered correctly i.e. if the search result was selected by the searcher. The information associated with 'types' depends greatly on the kind of semantic data being dealt with by the network and the strictness of the peer confidence ranking algorithm. How this data is stored and how the semantic routing algorithm uses this data is beyond the scope of this article, however it has been shown that automated relaxation of queries can lead to more robust searches.
An important factor in order for the routing algorithm to be effective in the long term is persistence. Nodes must have a constant identifier within the network's namespace if they are to retain their confidence ratings. There is an initial forming stage where none of the peers have ratings for any nodes, and nodes might be returned randomly (or indeed using traditional routing methods). It has been observed that a certain amount of random responses to all requests (even after the forming stage) can avoid an effect called 'overfitting', which is when the confidence data associated with the nodes becomes too rigid and inflexible.
Semantic routing is a reasonably new idea and as such is in its infancy as a technique. A universal consensus on a specific algorithmic approach has not been reached and there is not yet a paradigm under which research into p2p semantic routing is conducted. The advent of semantic networking is making this an important topic, and one that is sure to grow and flourish in future.
Freenet is a peer-to-peer networking system with very specific goals - both political (anonymity, and freedom of speech) and technical (decentralisation of all network functions and efficient dynamic storage and routing of information).
Each node on the network acts as a data store, and has no say in what data it contains - any other node may read from or write to this. This enables the network to act as a distributed file-system. The data on each node is encrypted, providing further anonymity.
Files on the system are referenced and located using hash-keys. In order to search for a file a user must know it's hash-key, or at least how to calculate it. Once this is known the user sends a request to their own freenet node. Each request has a hops-to-live value, and a randomly generated id so it can be recognised and rejected by nodes which have seen it before. When a node receives a request it first checks it's own data-store to see if it contains the relevant file. If it does then it returns the file along with a message identifying it as the source of the data. If not it decrements the hops-to-live value and forwards the request to one of the neighbours in it's routing table. If this request fails then it asks another neighbour from the routing table. If none of it's neighbours return a positive response then it returns a failed message itself. If the hops to live value is exceeded then a failed message is also returned. If a positive response is received from a node then this node sends the data to the node it received the request from, and caches a copy of the data in it's own data-store. If the data-store is already full, then the least recently used files are deleted to make room for the new one. In this way, more popular data will be cached on multiple nodes throughout the network, and data that no-one requests will eventually be removed entirely.
Inserting Data into the Network:
In order to insert a file into the network, a request is made using the file's hash-key. If the file is found by one of the other nodes before the hops-to-live expires, then the file is returned. Thus there will be no need to re-insert the data into the network. If the request fails however, the user can then send the data onto the network. This data will follow the same path as the initial request, and will be copied by all nodes along the way. Each node can claim that it or any other node is the source of the data, thus providing anonymity for the original source.
Choosing the next neighbour to request content from is an important part of freenet's behaviour, as it is this that it allows it to adapt to usage patterns and network changes.
Each node has a routing table which contains the address of it's neighbours, and also their performance in returning particular keys. This performance rating was initially just an indication of how successful the node had been in retrieving the key, but has been improved to include response-time and transfer time. When a node receives a request for a particular key it searches it's table to find the node which was most successful in returning a similar key. It is important to note that it is the keys which are being compared for closeness, not the files, so there is no semantic similarity implied. It then requests the file from this node.
When a node receives a successful response from another node, it will create a new entry in it's table associating the source node with the requested key. An emergent property of this is that nodes will start to specialise in particular keys. When the network is set up there will be no performance records for any nodes, so requests are sent to node which is essentially chosen at random. If it successfully returns the requested file, then an entry will be created for that node with the relevant key. Requests for similar keys will then be routed towards the successful node, and as a result it will eventually start to specialise in these keys.
- Popular data is replicated throughout the system.
- Data is distributed anonymously and freely, the main goals of the system.
- The routing algorithm is designed to adapt and improve efficiency over time.
- People are hesitant to donate part of their hard-drive to the system, particularly when it could be used to store information they don't approve of.
- The network is not easily searchable - users need to know the key to find a file.
- The network is slow, as the data must pass through all intermediate nodes. Searches can be particularly slow because they don't use multicasting.
- Gnutella Protocol Description
- The cost of peer discovery and searching in the Gnutella peer-to-peer file sharing protocol
- Wikipedia article on Gnutella
Distributed Hash Tables
- W3C RDF page
- Super-peer-based routing and clustering strategies for RDF-based peer-to-peer networks
- Resources on Collaborative Filtering