CCOWTM Replicast is a storage protocol that uses multicast communications to determine which storage servers will store content and then retrieve it for a consumer. It also allows content to be accepted/delivered/replicated using multicast communications. Content can be placed once and received concurrently on multiple storage servers. Replicast can also scale to very large clusters and can support multiple sites, and each site can be as large as the networking elements will allow.
In order to understand how Replicast works, you must first understand how it uses Multicast addressing. Specifically, how the role of the Negotiating Group and Rendezvous Group differs from Consistent Hashing algorithms which are the normal solution for distributed storage systems.
How Conventional Object Storage Systems use Consistent Hashing
The Object name, sometimes referred to as the payload of a chunk, is used to calculate a Hash ID. This ID is then mapped to an aggregate container for multiple objects/chunks (for OpenStack Swift these are called “Swift Partitions” and for CEPH they are called “Placement Groups”). Although the quality of the hashing algorithm can vary, the content of a chunk has to map to a set of storage servers that is based on an Object Name in order to achieve a consistent hash algorithm. If you start with the same set of storage servers, the same content will always map to the same storage servers.
Promoters of Consistent Hashing make the point that Consistent Hashing limits the amount of content that must be moved when a set of storage servers change. If there is a 1% change in the cluster membership then 1% of the content must be relocated. In the long run, you actually want 1% of the content to move to the new servers. Should 1% of the content be lost, you will want to create new replicas of the lost 1% on other servers anyway.
Where CCOW Replicast differs is that it can be far more flexible about when that replication occurs and more selective as to which data is replicated. Replicast has a different method of assigning locations. These more efficiently deal with evolving cluster membership to achieve far higher utilization of cluster resources when the membership isn’t changing.
CCOW Replicast uses a “Negotiating Group” to effectively support the chunks “location”. An object name still yields a Name Hash ID (using the Name of the Named Manifest) but that Hash ID maps to a Negotiating Group. When a Manifest references a Chunk, it is found by mapping its Chunk ID (which is the Content Hash ID of the Chunk) to a Negotiating Group.
The Negotiating Group will be larger than the set of servers that would have been assigned by Consistent Hashing. Typically ten to twenty members of the Negotiating Group is preferred. The key is that the client, or more typically the Putget Broker on the client’s behalf, uses multicast messaging to communicate with the entire Negotiating Group at the same time. Effectively the Putget Broker asks “Hey you guys in Group X, I need three of you to store this Chunk”. A Negotiation then occurs amongst the members of the Negotiating Group to determine which three (or more) of members will accept the Chunk, when and at what bandwidth.
“Negotiating” sounds complex but the required number of message exchanges is actually the same as any TCP/IP connection setup. So the Negotiating Group can determine where the Chunk will be stored and with the same number of network interactions as Swift requires for the first TCP/IP connection. For the default replication count of three, Swift requires three connections to be setup.
More importantly, a consistent hashing algorithm (such as Swift uses) will always pick the same storage servers. This is independent of the workload of these servers. Many consider this as the price of eliminating the need for a central metadata server.
With Consistent Hashing, the 3 servers with the lightest workload are selected out of 3 storage servers (assuming the replication count is 3). Of course that also means you are also selecting the 3 busiest servers. With CCOW Replicast you select the 3 servers with the least workload from all the available servers.
Implications of Dynamically Selection
With dynamic load-sensitive selection, CCOW Replicast enables you to a) run your cluster at higher performance levels than Consistent Hashing would allow, and b) still have lower latency.
A well balanced storage cluster will at peak usage want individual storage servers to be loaded only 50% of the time. If they are heavily loaded less than 50% of the time then the cluster could accommodate heavier peak traffic and you have overspent on your cluster. If they are loaded more than 50% of the time then some requests will be much delayed and your users could start complaining. Should the chance of a randomly selected storage server being busy is 50%, what are the chances that all 3 randomly selected storage servers will not already be working on at least one request.
When it is time to retrieve a Chunk, the client/putget broker does not need to know what servers were selected. It merely sends a request to the Negotiating Group. The negotiating group picks one of its members with the desired chunk and the rendezvous is scheduled to transfer the data.
While the Negotiating Group plans a transfer, it is executed by the Rendezvous Group. The Rendezvous Group implements Replicast’s most obvious feature: Send Once, Receive Many times. Transfers sent via the Rendezvous Group are efficient not only because they only need to be sent once, but also because all Rendezvous Transfers are using reserved bandwidth which they can start at full speed. There is no need for a TCP/IP ramp up.
An important aspect of Rendezvous Groups is that they are easily understood and verified with a known relationship to the membership in the Negotiating Group:
- Put transactions – every member of a Rendezvous Group is a member of the Negotiating Group with a planned Rendezvous Transfer or a slave-drive under active control of a member
- Get transactions – the principle member of the Rendezvous Group is the client or Putget Broker that initiated the get transaction. Additional storage servers could have also been part of a put transaction. These additional targets are piggy-backing creation of extra replicas on the delivery that the client required anyway.