- The future of Peer-to-Peer
Peer-to-Peer Technologies and Protocols
This article discusses the peer-to-peer technologies in common use today and takes a look at possible future applications in this area. The architectures and protocols are examined in enough detail to give a basic understanding of their pros and cons and to highlight the difficulties that must be overcome for future generations of peer-to-peer.
Peer-to-peer, although not a new idea has been catapulted into the news in recent months thanks primarily to Napster, and to a lesser extent Seti@Home and GNUtella. Intel are pushing it as part of their marketing campaign for the Pentium IV and O'Reilly are doing summits on it and publishing $500 books on it.
However peer-to-peer is far from a new technology. The servers in many old technologies cooperate in a peer-to-peer manner to exchange required information. News, Email and IRC all fall into this category. In fact IRC takes it a step further with clients on the network being able to connect to each other directly to exchange resources.
The term "Peer-to-peer" has no strict and formal definition and a lot of disagreement exists over its exact meaning, but as a rule of thumb it can be thought as any system that allows the utilisation of "the dark matter of the Internet" - home PCs. Shirky defines a "litmus test" for peer-to-peer as follows:
In the peer-to-peer world there are different categorisations of this technology that range from completely centralised to completely decentralised. In one corner we have Napster and Seti@Home, which are centralised, and in the other Freenet and GNUtella, which are decentralised. Napster incorporates a centralised indexing server, which takes care of what files are on what clients, and which clients in turn send searches to for locating resources. Freenet and GNUtella on the other hand distribute the database of shared resources across the clients on the network removing any need for the central server.
The protocols and topologies of the centralised peer-to-peer technologies aren't remarkably exciting or complex. They operate on the client-server design with file transfers (in Napster's case) occurring on the client-client level. As a result there isn't very much of interest to discuss.
On the other end of the extreme are the completely distributed architectures, which have very interesting and quite often complex topologies and protocols and will be discussed to a much greater degree.
In this section topologies that fall into the category of centralised peer-to-peer, such as Napster's and Seti@Home's topologies are discussed. There are many more applications out there that have similar topologies but these two are the most well known.
The program "Napster" came into being in January 1999 when Shawn Fanning, a freshman at Northeastern University, wrote an application to allow music sharing between people in his dormitory. Napster Inc. was founded in May of that year and scaled up to massive 21 million users. It took only until December 1999 before the record industry of America sued Napster for copyright infringements and currently, although Napster is still in operation, it has only a tiny fraction of its once huge user base.
Napster is based on a client - server architecture. The role of the server is to hold a searchable index that contains entries of mp3s that all the currently connected clients contain. The server is actually multiple very hi-spec machines load balancing the requests from clients. This makes scaling the service simply a matter of adding machines into the server pool and ensures redundancy in the fact that servers can fail and be replaced without significant disruption to the service they are providing. Redundancy needs to be implemented for the connection between client and server as well so the servers are placed on multiple connections to different large ISPs.
The clients have the functionality of being able to index and associate meta-data with shared mp3s on their own machine. This information is then sent to the Napster servers when connecting. At this point the client may search all clients connected on Napster by sending search queries to the Napster server. The server will search its internal indexes of currently shared files and return results to match. The results contain the meta-data about the file, the location of the file and speed of the clients that are sharing the files. If the client wishes to download one of the files contained in the search results then it connects directly to the other client sharing the file and begins the download. The file itself never passes through or is stored on the Napster server. This is the peer-to-peer aspect of the protocol.
Example of a Napster query and download.
Seti@Home is part of the Search for Extraterrestrial Intelligence that has over two million computers crunching away and downloading data gathered from the Arecibo radio telescope in Puerto Rico. The project cost us$500,000 for Berkeley to set up but produces over 15 teraflops of processing throughput. As a comparison, ANSI white, the world's most powerful supercomputer, produces 12 teraflops of processing throughput at a cost of us$110 million. The SETI@Home project is widely regarded as the fastest computer in the world. In fact, the project has already performed the single largest cumulative computation to date.
From the architecture point of view Seti@Home is based upon client-server. The centralised servers hold enormous amounts of data gathered from the Arecibo radio telescope "listening" to the skies. That data needs to be analysed for distinct or unusual radio waves that might suggest extraterrestrial communications. The servers chunk the data up into small packets that can be downloaded by the clients and analysed and finally the results uploaded and verified. The servers have to have huge bandwidth available to them and enough processing power to verify that the data being returned hasn't been faked by the clients.
The clients need very little functionality. They have the ability to do calculations on the data provided and are able to communicate with the central servers. The clients run the calculations continuously with a low priority and communicate with the server only to return results and to ask for new data.
This behaviour doesn't seem to be peer-to-peer at all though. It seems just like old asymmetric client - server architecture in which clients download data, crunch away and then upload it back onto the central server. There are no client-to-client, or peer-to-peer connections at all. However if you look at what's happening you will realise that the clients aren't just dumb browsers. They are taking an active role in the functioning of the network. In fact, without them, the network wouldn't work at all. It's what Clay Skirky called utilising "the dark matter of the Internet". "There's this vast ocean of untapped computation and storage power at the edge of the Net that we're only now just now beginning to integrate with the fabric of the Internet as we know it today."
In June 1999 an undergraduate at Edinburgh University, Scotland named Ian Clarke completed his Final Year project called Freenet and made it available to the world over the net. The project described a distributed decentralised peer-to-peer resource-sharing network that focussed on anonymity and freedom from any type of control. There is currently a relatively small user base for the Freenet primarily because it's quite difficult to find resources you are looking for. However Orwant has performed research that indicates that of the users using Freenet, 60% of material they're sharing is illegal, and 50% of it infringes on copyright.
Freenet's architecture is completely decentralised and distributed, meaning that there are no central servers and that all computations and interactions happen between clients. On Freenet, all connections to the network are equal. Clients connecting to Freenet connect randomly to any clients available making an unorganised scattered topology.
Communications on Freenet occur by sending a request to a client you are connected to, who in turn sends it on to another client they are connected to and so on. When a client receives a packet from another client they don't know whether the packet originated from the client who sent it to them or whether it originated elsewhere which lends itself to anonymity on Freenet. Freenet allows the functionality of being able to insert resources into the network and to search for and retrieve resources.
To insert resources into the network the resource is given a descriptive title identifier that is the hashed using SHA-1 to generate a unique key to identify the resource. The resource is then associated with that key and stored locally. A request for insertion into the network might also be made in which the resource with key attached is sent to other nodes to store. The "direction" in which this packet will travel is based upon the "nearness" of its key to other keys that other nodes are storing. This means that data with keys that are close, or near to other keys, reside close to each other in the network. This allows for intelligent routing to occur when passing packets from one node to the next as you know which is the best way to send the packet.
Searching and retrieval are one in the same. To search for a resource on the network you must first know its title. You then hash that title using SHA-1 and do a request for that by sending it to the most likely place to have the resource, based, once again, on key closeness. A depth first search then performed searching for nodes that contain the required key. The search backtracks if the hop count reaches zero or if during the search a node is seen twice. When the required resource is found the search terminates and the client with the resource begins to send the matching resource back along the search route to the client who requested the resource. All clients along the way will cache the passing data which aids in the replication of popular resources and means that frequently requested data is cached and dispersed widely around the network increasing redundancy and reducing access times. This also lends itself to the anonymous side of Freenet by providing anonymity for both publisher and consumer of the resource. This is accomplished by the fact that no client used in the resource retrieval knows if the resource it is downloading and passing on, is coming from the original publisher and going to the original requester, or whether its just coming from or going to some other links in the chain.
One important thing about Freenet is that file sharing in a Napster like fashion is only one of its possible uses. Ian Clarke originally envisioned it to be more of a device to allow publishing and retrieval of political discussions and other sensitive or potentially damning resources to be shared anonymously. Freenet also has the potential to create and anonymous and secure network in which copyright and illegal information can be traded with no fear of reprisal and if it can scale to any large degree the implications could be frightening.
On the 14th of March 2000 Nullsoft, a subsidiary of America Online, released a file sharing application called GNUtella that allowed file swapping without the need of a central indexing server and therefore no central point of failure, and no central point to sue for copyright infringements. On April 10th America Online declared GNUtella to be a rogue project and terminated it, but not before the program had been downloaded and replicated by thousands of users around the net. Over the next few weeks the protocol was reverse engineered and GNUtella clones began to appear.
GNUtella's architecture is similar to Freenet's in that it is completely decentralised and distributed, meaning that there are no central servers and that all computations and interactions happen between clients. All connections on the network are equal. When a client wishes to connect to the network they run through a list of nodes that are most likely to be up or take a list from a website and then connect to how ever many nodes they want. This produces a random unstructured network topology.
Routing in the network is accomplished through broadcasting. When a search request arrives into a client that client searches itself for the file and broadcasts the request to all its other connections. Broadcasts are cut off by a time to live that specifies how many hops they may cover before clients should drop them rather than broadcast them. There is a small degree of anonymity provided on GNUtella networks by this packet routing technique. Any client that receives a packet doesn't know if the client it has received the packet from is the original sender or just another link in the chain. This is somewhat undermined however by the fact that nearly all packets on the network start with a TTL (time to live) of 7 and therefore if you receive a packet with a TTL of 7 you can be nearly certain that the packet has originated from your immediate upstream neighbour. GNUtella allows the functionality of being able to search for files. All other operations such as uploads and downloads occur outside of the network and will be explained later.
Searching on GNUtella is accomplished by creating a keyword string that describes the file you want and broadcasting that string to all your connected neighbours. Your neighbours will then in turn broadcast that message to all their connected neighbours and so on until the packet's TTL has been reached.
Example of a GNUtella broadcast search. Illustrates searching, replying, packet meeting TTL and redundant loops.
As far as reaching a lot of nodes for searching this method can be quiet effective. With a TTL of 7 on the packet and an average amount of neighbours per client of 8 you can in theory reach 1,000,000 clients. In real world situations this is not the case. GNUtella does not scale up to numbers of that magnitude. In fact although on a good day you may have up to 40,000 clients on GNUtella you can only ever reach 2,000-4,000 of those clients with a search. In fact, if GNUtella wanted to scale up to a size similar to Napster's, on a slow day the network would have to move 2.4 gigabytes per second. On a heavy day, 8 gigabytes per second.
On a query match clients create a packet that contains information on how to locate them and the file (a URL). To route the replies all the clients send the query replies back along the path that it came. Eventually, after an undefined length of time all results will arrive back at the client who originally sent them out. At this point the client can decide which, if any, of the files that it wants to download.
To download a file the client creates a direct connection to the client with the file it wants and sends a HTTP packet requesting the file. The client with the file interprets this and sends a standard HTTP response. However this removes any anonymity in the system as there is no way to anonymously publish or consume resources.
Flash presentation showing query and download on GNUtella.
FastTrack is a recent arrival to the peer-to-peer scene and with its coming it brings a new, more scalable, architecture that still follows a decentralised design. The FastTrack protocol is currently used by two file sharing applications, KaZaA and Morpheus. The KaZaA application has had upwards of 20 million downloads and KaZaA can have anywhere up to 800,000 users connected at one time.
The FastTrack architecture follows a 2-tier system in which the first tier consists of fast connections to the network (Cable/DSL and up) and the second tier consists of slower connections to the network (modem and slower). Clients on the first tier are known as SuperNodes and clients on the second tier are known as Nodes. Upon connection to the network what happens is that the client decides whether you are suitable to become a SuperNode or not. If you can become a SuperNode you connect to other SuperNodes and start taking connections from ordinary Nodes. If you become a Node you find a SuperNode that will allow you to connect to them and connect. This produces a two-tier topology in which the nodes at the centre of the network are faster and therefore produce a more reliable and stable backbone. This allows more messages to be routed than if the backbone were slower and therefore allows greater scalability.
Routing on FastTrack is accomplished by broadcasting between the SuperNodes. For example, when a Node issues a search request to the SuperNode it is connected to the search request is taken by that SuperNode and then broadcast to all the SuperNodes it in turn is currently connected to. The search continues in this way until its TTL has reached zero. Every SuperNode that it reaches searches an index that contains all the files of its connected Nodes. This means, that with a TTL of 7 and with an average amount of Nodes per SuperNode of 10 a search request will search 11 times more nodes on a FastTrack network than on GNUtella. Unfortunately since the searches are being broadcast the network will still produce enormous amounts of data that needs to be passed from SuperNode to SuperNode. However since the SuperNodes are guaranteed to be reasonably fast it doesn't produce as large a problem as on GNUtella.
Routing of replies follows the same lines as GNUtella. Replies are routed back along the path that they came from until they reach the clients that originally issued them. A large problem with this type of routing in GNUtella was that clients making up its backbone were very transient and connected and disconnected to the network very sporadically which meant that packets being routed back along the path they came could find the path gone because a link in the chain had disconnected. This problem occurs less on FastTrack as clients making up the backbone are guaranteed to be faster and more stable and therefore paths for return routing packets should be more reliable.
Downloading on FastTrack is the same as on GNUtella. Once the location of the file has been found the client that wants the file connects to the client that hosts the file and sends a HTTP request for the file. The client hosting the file interprets the request and sends a HTTP response. The HTTP headers used in FastTrack have been modified to accommodate extra information such as meta-data but standard HTTP/1.1 headers are supported which means that files can be downloaded from KaZaA clients through a web-browser such as Internet Explorer or Mozilla.
Unfortunately although the FastTrack topology is a decentralised one KaZaA's implementation requires that all clients register with a central server before being allowed to connect to the network which invalidates all the advantages of having a decentralised topology. KaZaA are currently undergoing legal proceedings with the RIAA (Recording Industry Association of America) and if they lose and are shut down the KaZaA network will cease to function. Removing the decentralised aspect of FastTrack's design has introduced a point of failure to the network.
For there to be a future in large scale peer-to-peer applications many issues must first be addressed.
Primarily the problem with all current implementations of P2P applications are they're lack of scalability. The next generations of peer-to-peer applications have to have the ability to scale to sizes such that every person on the Internet can be running a P2P client without having to worry about issues such as network fragmentation and loss of search replies.
The second major issue to be addresses is trust in the network. If P2P is to become a viable application for the commons it must be possible to trust transactions occurring. One must be able to trust that the resource being acquired is what you think it is and that when sharing resources one must have the ability, if desired, to share to specific groups of trusted clients and be sure that no one is faking that trust. For P2P to be widely adopted it also has to allow, in a trusted environment, the ability to pay for resources and accept payments for resources which currently no architectures allow.
Finally, for a peer-to-peer network application to become, as Bob Cringely calls it, "the killer app" it must be robust. Currently on networks such as Napster and GNUtella downloads of files can quite commonly be interrupted or cancelled entirely by clients becoming unresponsive or logging off the network and dropping all their open connections. This can make downloading of resources very difficult, and for P2P applications to be widely accepted this must first be addressed. Currently KaZaA is one of the few applications that address this issue by attempting to stream downloads from multiple sources and automatically resuming downloads on failure. This technology however is in its infancy and needs to be developed further before it can be considered a successful solution to the problem.
 Shirky, C. 2001. "Listening to Napster. In Oram, A., (ed) Peer-to-Peer: Harnessing the Benefits of a Disruptive Technology", O'Reilly and Associates, Inc., Sebastopol, California.
 Sean Blanchfield, 2001. "An Anonymous and Scaleable Distributed Peer-to-Peer System". Unpublished undergraduate thesis, Trinity College Dublin
 "Advogato", 2000. "Advogato's Number: Peer to Peer", available online at: http://www.advogato.org/article/180.html
 Seti@Home, Frequently asked questions about Seti@Home, available online at: http://setiathome.ssl.berkeley.edu/faq.html
 Accelerated Strategic Computing Initiative (ASCI), ASCI White News, available online at: http://www.llnl.gov/asci/news/white_news.html
 Network World Fusion, "A peer-to-peer revolution", Mark Eggleston, available online at: http://www.nwfusion.com/newsletters/techexec/2000/0828techexec1.html
 Clay Shirky, "O'Reilly Network: Peer-to-Peer Makes the Internet Interesting Again", Andy Oram, available online at: http://linux.oreillynet.com/lpt/a//linux/2000/09/22/p2psummit.html
 Jon Orwant, "OpenP2P.com: What's on Freenet", available online at: http://www.openp2p.com/pub/a/P2P/2000/11/21/freenetcontent.html
 Jordan Ritter, "Why GNUtella Can't Scale. No, Really.", available online at: http://www.darkridge.com/~jpr5/doc/gnutella.html
 Bob Cringely, "In Defense of the Free Ride", available online at: http://www.pbs.org/cringely/pulpit/pulpit20010208.html
Hotline Communications is founded, giving consumers software that lets them offer files for download from their own computers.
Scour, an entertainment portal with multimedia search technology, is founded.
Shawn Fanning, 18, creates the Napster application and service while a freshman at Northeastern University.
Napster Inc. is founded.
London programmer Ian Clarke completes the original Freenet design as a student at Edinburgh University, Scotland, and makes it available on the Internet.
December 7 1999:
The record industry sues Napster for copyright infringement.
March 14 2000:
America Online's Nullsoft releases a file-swapping program dubbed Gnutella.
April 4 2000:
Scour announces the beta launch of Scour Exchange--file-sharing technology that lets people search for and trade video, picture, music and text files.
April 10 2000:
AOL shuts down the Gnutella project.
July 20 2000:
The record and motion picture industries sue Scour, alleging copyright infringement.
July 26 2000:
A federal judge orders Napster to halt the trading of copyrighted material.
July 28 2000:
FastTrack announces the launch of KaZaA
July 28 2000:
A federal appeals court stays the order against Napster pending an appeal of the decision.
August 24 2000:
Intel forms a peer-to-peer working group with IBM and Hewlett-Packard, among others.
October 12 2000:
Scour files for bankruptcy protection.
January 29, 2001:
Bertelsmann announces that Napster will introduce a membership fee for users in the summer of 2001.
February 12, 2001:
The 9th U.S. Circuit Court of Appeals rules that Napster knew its users were violating copyright laws through its music file-sharing service, but the court allowed the Web site to stay in business until a lower court redrafts its injunction. The three-judge panel specifically cited a memo drafted by Napster's co-founder Sean Parker as evidence the Web site knew its users were violating copyright laws. In that memo, the court said, Parker said the company needed to remain ignorant about the "real names" of the users because "they are exchanging pirated music." For that reason, the court found that Napster was involved in "contributory and vicarious infringement," and had full knowledge that it was allowing its users to infringe upon copyright laws.
Feb. 20, 2001:
Napster offers $1 billion settlement to record companies to drop their suits. The offer is rejected two days later.
March 2, 2001:
Napster lawyers tell a federal district court that they will implement a plan to prevent the songs from being traded and begin filtering a list of 1 million copyrighted files from its system.
March 14, 2001:
Napster signs an agreement California-based Gracenote, whose online song database is used for online information access and software applications. Napster will have full access to the database, which will help the complex task of filtering out copyrighted material.
June 5, 2001:
MusicNet and Napster reach an agreement to license its digital music on Napster's new site.
June 25, 2001:
Napster signs a worldwide licensing agreement with the United Kingdom's Association of Independent Music and the Independent Music Companies Association to provide music for its new subscription service.
June 27, 2001:
The Academy of Motion Picture Arts and Sciences files suit against Napster, charging that the online service has allowed users to download recordings of artists' performances recorded during Oscar telecasts.
July 12, 2001:
Patel orders Napster to remain offline until it can show that it is able to effectively block access to copyrighted works. Metallica and Dr. Dre settle their legal disputes with Napster, ending all legal actions between the parties.
October 02, 2001 :
RIAA and MPAA file lawsuit against KaZaA