Friday, August 28, 2009

WEBDFS on 20 Amazon ec2 instances

Hello Again,

Download WEBDFS here:  http://code.google.com/p/webdfs/downloads/list

This blog post is about a real world test of WEBDFS tested on 20 Amazon ec2 instances.

I setup 20 m1.large amazon instances and uploaded ~250GB of data and downloaded ~1.5 terabytes worth of data from the files that were uploaded.  Total transfer was ~1.8 TB

I have included all of the pertinent information at the end of this blog post.  Here are some highlights:
  • 500 threads (5 java clients, 100 threads each)
  • 15 servers
  • ~250 GB uploaded (PUT requests) (individual files between 50k and 10mb)
  • ~1.5 Tb downloaded (GET requests)
  • ~1.8 TB transfer total
  • ~47mb / sec upload rate
  • ~201 requests / sec overall
  • The data was evenly distributed across all nodes
  • 40 replicas were lost and totally unrecoverable amounting to .030% data loss
The final results tell me that WEBDFS performed quite well.  However the data loss is undesirable and needs to be addressed.  How do we prevent that?  One strategy is to be atomic and durable, making absolutely sure that we hold on to the client until we know undoubtedly that the first replica has been successfully written to disk.  This can be achieved with some sort of a checksum mechanism to be sure that the file upload is not corrupted.  We could also add an fsync() call after each file write. (but sometimes that is not always reliable, I actually did this and tested it and will report the results in a future post). The point is we do not want to tell the client that an upload was OK when it was not and we do not want to replicate a corrupted file.  This is exactly what happened which led to 40 of the replicas becoming totally unrecoverable meaning that we did not have a good copy from which we could make other replicas.

Another thing that might help would be to increase the replication factor.  However, there is something to consider with WEBDFS when increasing the replication factor (at least in its current state).  Increasing the replication factor increases the write load on your over all system.  In the case of this test, increasing the replication factor from 2 to 3 will have an impact of increasing the write load by 50% which could be quite significant.  We can see that as things scale the impact of increasing the replication factor decreases, so this is a diminishing problem as we scale but still something to consider.

So how did the servers perform?   I have included a couple load avg graphs below to show how a typical box performed and the graph from the one server that got really hot.

The graphs show the machines stayed mostly cool,  load averages were in the 3-4 range most of the time on most machines and the CPUs were rarely maxed out.  There was one exception, one machine went to 100 load avg and stayed there for most of the test until I rebooted it after which point the machine went to and stayed at a level similar to the rest of the boxes in the cluster.   I am not sure why the one server got really hot.  After analyzing the server logs, I do not see any particular request pattern that would make the one server go really bad.  However, it is obvious that this server was the reason for the data loss and problems experienced during the test.

Load avg graph of typical server during the test:



Below is the load avg graph of the server that became very hot during the test.  You can see that after about 1000 secs the box suddenly became very hot.  None of the other boxes exhibited this behavior.  Also, you can see the obvious point where I rebooted the box and how the server never went back to to a high load average.  further analysis is needed to determine why this might have happened.



Below are the specifics of the test and the results:

Hardware and Software:

Description
HardwareAmazon ec2 instance m1.large
CPU4 EC2 Compute Units (2 virtual cores with 2 EC2 Compute Units each)
Operating SystemFedora Core 9 64 bit
Memory7.5 GB
File SystemExt3
 I/O"High" - exact SAN specs not available from Amazon (AFAICT)
Storage850 GB instance storage (2×420 GB plus 10 GB root partition)


Test Parameters:

Total
total servers15
sub clusters5
servers per sub cluster3
total clients5
total threads per client100
total threads500
read:write ratio8:2
replication degree2
Total to be uploaded250 Gb
Total to be written to disk500 Gb
Total uploaded by each client50 Gb
Thread Task DescriptionEach thread uploaded and downloaded files between 50k and 10mb in size.  An attempt was made to simulate real world usage by having a 20% write load.  So for every 8 GETs there would be on average 2 PUTs.  Each thread slept for a random number of milliseconds between 0 and 1000 (0 and 1 second) between each request.
The files sizes uploaded were:
  • 50k
  • 100k
  • 500k
  • 1mb
  • 3mb
  • 5mb
  • 10mb


Test Results


Total
Total Requests1067159
Total Run time~5300 seconds
Total Uploaded250000850000 bytes
Total Upload Rate~47169971.7 bytes / second
Avg Download per replica 1933823.62 bytes
Total Downloaded808603
Requests / Second~201 req./second
Total Data Transferred1813697450000 bytes
Overall Transfer Rate342207066 / second
Total Original Replicas 129278
Total Replicas written to disk 258556
Total Written to disk 499200710256 bytes
Avg written to disk per replica1930725.69 bytes / replica
Total Lost Replicas0
Total corrupted but recoverable replicas2071
Total Non-Recoverable replicas due to corruption(both copies were corrupted) 40

Data Distribution:

The table below describes how many replicas were saved on each server.  As we can see the data was well distributed amongst all nodes.


ServerNo of ReplicasServerNo of Replicas
117406917623
2179141016679
3171531117606
4173291217256
5174261316703
6169551417190
7170821517145
817089

Thursday, July 30, 2009

Algorithm changes and the next release

Hello again,

There is a new release of WEBDFS available here:

http://code.google.com/p/webdfs/downloads/list
There are test results below for the new version.

I said in my last post that I was going to to talk about some algorithm changes so lets do that. There were two main drawbacks to the RUSHp algorithm implemented in the previous release of WEBDFS:
  • Reorganization

    As can be seen in the prior test results (in the prior blog post), data reorganization was optimal only when a new sub-cluster was added. All other configuration changes caused a significant amount of data to be moved. In some cases, more than 90% of the data was reorganized. This is obviously undesirable. Say an installation contained 100 Terabytes of data. Obviously it is not good to have to move 90+ Terabytes to accommodate removing the first sub-cluster of old machines. This could cause some serious degradation in service quality if the internal network upon which the system is installed becomes saturated simply moving data between machines. Also, when that much data starts to move around, the probability of failure is increased, which is also undesirable.

  • Replication Policies

    A replication policy is the number of replicas that a certain object has in the system. The previous version of WEBDFS using RUSHp had some drawbacks related to the replication policies. You could not have a replication factor that was greater than the number of nodes in the first sub-cluster. This means that if you started with 2 nodes in your first sub-cluster (which many folks would probably do) then you would not be able to increase the replication factor as you scaled and added servers unless you changed the number of nodes in the first sub-cluster, which would result in most of the data being moved, which as we pointed out above is undesirable.
Well, the current release of WEBDFS addresses these problems by using the RUSHr variant of the RUSH algorithms. RUSHr delivers near-optimal re-organization when a sub-cluster is removed or re-weighted. Additionally, the replication policy is adjustable with few constraints, the only constraint is that you cannot have a replication policy that exceeds the number of servers in the system which does not make sense anyway. So don't do that. :P the primary reason for the improvements is because we use the hypergeometric distribution for selecting replicas and where they belong. I will get into more detail about this in a future blog post.

Ok, below are the test results for the latest release of WEBDFS. We use the same tests as we did in the last blog post. I am quite pleased with these results and probably will not be making many changes to the core algorithms going forward. I might be switching to using SPRNG to increase the lookup speed but that will be all. In the future, all blog posts will be focused on using WEBDFS in the real world.

on to the results:

I used the paper here as a reference test results (section 3 specifically):

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

all graphs were generated with jgraph:

http://www.aditus.nu/jpgraph/

All test code was executed on a macbook 1.83 Ghz Intel Core Duo with 2 GB 667Mhz DDR2 SDRAM

We cover four areas:
  • object distribution
  • failure resilience (replica distribution)
  • reorganization (adding and removing servers)
  • look-up performance
All tests except for the look-up performance were done with a configuration of three sub-clusters each containing 5 nodes. 10000 objects were created and placed with three replicas each. I started with a weight of 1 and each sub-cluster was weighted twice as much as the cluster to its left. Meaning:

cluster 1 has a weight of 1
cluster 2 has a weight of 2
cluster 3 has a weight of 4

===============================

Object Distribution:

we can see here that the object distribution is similar to what we saw in the previous tests.
















================================

Failure Resilience:

Failure resilience deals with how WEBDFS will distribute load when a disk fails. The load distribution will be determined by the location of the replicas for the failed disk. We can see from the following three graphs that WEBDFS does an excellent job of distributing the load when a disk fails.

We can see that replica distribution follows the weighting pattern and that the ensuing load due to the failure will be spread amongst all other nodes. This is exactly what we want for dealing with failed servers. Below we include the replica distribution for disks 1 and 15 to illustrate that all nodes express the same characteristics for replica distribution.










replica distribution for disk 15

















replica distribution for disk 1

















================================

Data Reorganization:

For the reorganization tests I started with the configuration as described above and:
  • added a sub-cluster
  • removed the first sub cluster
  • re-weighted the second sub-cluster
  • removed the first disk from the first sub-cluster
What we are concerning ourselves with here is the number of objects that are moved during a reorganization. Reorganizing typically means we are adding new machines or de-allocating old resources. We compare our results to what we consider to be optimal, optimal meaning that we move the least number of objects to accommodate a new configuration. When we add servers we expect the optimal number of moved objects to be an equal amount taken from each server according to its sub-cluster weight. When we de-allocate resources, we expect the optimal number to be the number of objects held by that server. We see in the results below that WEBDFS does an optimal job when adding new sub-clusters, and functions at what we call a near optimal level.

The graph to the left indicates what happens when we add a new sub-cluster of five disks. we see that an optimal amount of objects are moved to accommodate the resources. This is good as adding new servers will probably happen more frequently then de-allocating old servers.












The graph to the left shows the objects that move when the first sub-cluster is removed. Removing a sub cluster, means setting the weight to 0 for that sub-cluster and waiting for the objects to be moved. After which point the servers can be taken off-line.

The graph to the left expresses near-optimal behavior. We say near-optimal because optimal would mean that only the objects in the first sub cluster would be moved. And we can see that in addition to the objects in the first sub-cluster, a small number of objects in the second sub-cluster are also moved. Specifically, we expect that 4247 objects would be moved and we see that actually 5581 objects are moved. This is a huge improvement over the last release of WEBDFS

The graph to the left shows what happens when we increase the weight on the second sub-cluster from a value of 2 to 4. This might happen when doing replacement of a sub-cluster with new machines where the new machines are equally as powerful as the servers in the more recently added sub-clusters.

We can see from the graph on the left that no objects are moved from the second sub-cluster at all and that all moved objects come from the first and third sub-clusters. This is exactly what we want to happen. The next question is whether or not an optimal or near optimal amount of data is moved. Optimally, we would expect that 4761 objects would be moved to accommodate the new weighting and we had 5368 objects moved in our tests, which is near optimal and again a huge improvement over the last release of WEBDFS.

This next graph to the left shows how many objects get moved when we remove a disk or server from the first cluster. This does not seem like something that would happen very often in a real deployment if at all, but we include it here for completeness.

We see that the data movement is not really optimal . We expect that only the objects from a single disk (in this case disk 1) should be moved, but we can see that this is not the case and that objects from the other disks in both sub-clusters 1 and 2 are moved. However not many are moved, so removing a single disk will not cause an extreme disruption as is the case with the prior implementation of WEBDFS where more than 50% of the objects in the system were moved.

Specifically we would expect that 877 objects would be removed from the first disk and that is all, however our tests showed that 2762 in total were moved. Again this is not an extreme amount and is a huge improvement over the previous implementation of WEBDFS.

================================

Lookup performance:

For the look-up performance tests, I created 1 million objects and started with a single sub-cluster of five disks and ran the look-ups for the 1 million objects and took the avg look-up time across all look-ups. This was repeated for 100 sub-clusters of five disks each where a sub-cluster was added with an exponential weighting increase of 1.1 and the look-ups repeated and the avg time for each look-up was recorded. I then repeated this experiment with even weighting so we could get a look at how weighting affects look-up performance.

The graph to the left shows lookup performance and scaling characteristics. The biggest factor affecting scaling in WEBDFS is the number of sub-clusters that must be investigated when performing the lookup.

With evenly weighted disks we expect to see super-linear scaling. We expect O(nr) time where r is the time taken to generate the random numbers for investigating clusters.

But with weighting we expect sub-linear scaling because more objects will be located in the more heavily weighted disks, the more weighted disks will be "rightmost" in the sub-clusters, meaning that they will be investigated first, meaning that we will probably find the object we are looking for within the first few sub-clusters.

We can see from the graph that our expectations are met. When the servers are all weighted evenly we have slightly super-linear scaling. This is because of the number of times we have to call the random number generator when investigating clusters. We could bring an even weighted configuration much closer to linear scaling by using a better random number generator that provides some sort of jumpahead functionality so we will not have to call the rand() function equal to the id of the sub-cluster we are investigating.

In all likelihood, deployments of WEBDFS will have sub clusters with uneven weights where the most recently added clusters are more heavily weighted than older clusters so we should get sub-linear scaling in real deployments of WEBDFS.

OK, so the test results above show a marked improvement in WEBDFS which is very exciting. I believe that we are getting ready to do a beta release. Before that happens though, want to improve the unit tests and improve the client library. So be on the lookout for the beta release.

Friday, July 17, 2009

First round of test results

Below are the test results for WEBDFS in its current implementation.

In Summary, WEBDFS in its current implementation does perform as expected according to the RUSH variant algorithm in this paper:

http://users.soe.ucsc.edu/~elm/Papers/ipdps03.pdf

That is the main point of these tests, to show that the current implementation of WEBDFS is correct and does indeed do what it is supposed to do. However, after testing I am not sure that the current implementation will fit the needs of a cloud like store for a web application. There are three main variants of the RUSH algorithms known as RUSHp, RUSHr, and RUSHt. WEBDFS currently has a RUSHp implementation at its core. I am looking at RUSHr as being the best choice because it seems to have the most flexible replication policy for an object which seems to better map to the needs of a web app where the popularity of an asset can suddenly change (say due to a post on slashdot, digg, etc). I will get into my reasoning more in depth in the next blog post.

on to the results...

I used the paper here as a reference test results (section 3 specifically):

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

all graphs were generated with jgraph:

http://www.aditus.nu/jpgraph/

All test code was executed on a macbook 1.83 Ghz Intel Core Duo with 2 GB 667Mhz DDR2 SDRAM

We cover four areas:
  • object distribution
  • failure resilience (replica distribution)
  • reorganization (adding and removing servers)
  • look-up performance
All tests except for the look-up performance were done with a configuration of three sub-clusters each containing 5 nodes. 10000 objects were created and placed with three replicas each. I started with a weight of 1 and each sub-cluster was weighted twice as much as the cluster to its left. Meaning:

cluster 1 has a weight of 1
cluster 2 has a weight of 2
cluster 3 has a weight of 4

===============================

Object Distribution:

We can see here that WEBDFS is very close to optimal distribution. All of the RUSH variants exhibit this same behavior.















=================================

Failure Resilience:

This graph shows the replica distribution of objects that were located on disk 7 which theoretically failed.

What we can see here is that the more heavily weighted servers would take most of the request load from a failed disk which is what we would want assuming that the more heavily weighted servers are more powerful.

Immediately below are two more graphs that show replica distribution for disks 1 and 15. You will see from the graphs that the replica distribution has a tendency to favor the more heavily weighted disks. Which again is what we want assuming the more heavily weighted disks are the more powerful servers.


disk 15 replica distribution

















disk 1 replica distribution

















=================================

Reorganization:

For the reorganization tests I started with the configuration as described above
  • added a sub-cluster
  • removed the last disk
  • removed the first sub cluster
What we are concerning ourselves with here is the number of objects that are moved during a reorganization. Reorganizing typically means we are adding new machines or de-allocating old resources. We compare our results to what we consider to be optimal, optimal meaning that we move the least number of objects to accommodate a new configuration. When we add servers we expect the optimal number of moved objects to be an equal amount taken from each server according to its sub-cluster weight. When we de-allocate resources, we expect the optimal number to be the number of objects held by that server. We see in the results below that WEBDFS does an optimal job when adding new sub-clusters, but performs in sub-optimal fashion when de-allocating resources. This is expected according to the authors of the core RUSH algorithm.

The graph to the left indicates what happens when we add a new sub-cluster of five disks. we see that an optimal amount of objects are moved to accommodate the resources. This is good as adding new servers will probably happen more frequently then de-allocating old servers.












This next graph shows what happens to server when the first sub-cluster is removed. This is something that might happen when a whole cluster of machines has been determined to be too slow to serve any further purpose or maybe they could be put to use elsewhere.

In our example the first sub-cluster contains approximately 4265 objects so we expect the optimal amount of objects moved to be 4265. However, we can see from the graph that MANY MANY more than that were moved. In fact, the total amount actually moved was 27943!!!. ugh that means that 27943 / 30000 = 91% of the objects get moved. That is almost a complete reorganization! As I stated above, we are going to use a different RUSH variant that will not cause such huge movement when we de-allocate resources. RUSHr does allow for sub-cluster and has near optimal reorganization characteristics.
This next graph shows what happens when a disk is removed from the last sub-cluster.

again we see a less than optimal number of objects moved.

the total objects that we expected to move was around 3480 and we actually moved a total of 16567. which means that we moved more than 50% of the objects to accommodate removing a single disk. this just affirms what we saw above for the sub optimal performance and affirms the idea that we should use a different variant of the RUSH algorithm.




=================================

Lookup Performance:

For the look-up performance tests, I created 1 million objects and started with a single sub-cluster of five disks and ran the look-ups for the 1 million objects and took the avg look-up time across all look-ups. This was repeated for 100 sub-clusters of five disks each where a sub-cluster was added with an exponential weighting increase of 1.1 and the look-ups repeated and the avg time for each look-up was recorded. I then repeated this experiment with even weighting so we could get a look at how weighting affects look-up performance.

The graph to the left shows lookup performance and scaling characteristics. The biggest factor affecting scaling in WEBDFS is the number of sub-clusters that must be investigated when performing the lookup.

With evenly weighted disks we expect to see super-linear scaling. We expect O(nr) time where r is the time taken to generate the random numbers for investigating clusters.

But with weighting we expect sub-linear scaling because more objects will be located in the more heavily weighted disks, the more weighted disks will be "rightmost" in the sub-clusters, meaning that they will be investigated first, meaning that we will probably find the object we are looking for within the first few sub-clusters.

We can see from the graph that our expectations are met. When the servers are all weighted evenly we have slightly super-linear scaling. This is because of the number of times we have to call the random number generator when investigating clusters. We could bring an even weighted configuration much closer to linear scaling by using a better random number generator that provides some sort of jumpahead functionality so we will not have to call the rand() function equal to the id of the sub-cluster we are investigating.

In all likelihood, deployments of webdfs will have sub clusters with uneven weights where the most recently added clusters are more heavily weighted than older clusters so we should get sub-linear scaling in real deployments of webdfs.

So that's it for now. As you can see, the current implementation of WEBDFS will meet the criteria for most of the stated claims. The problem area is in reorganization. So that will be addressed in upcoming releases. you can download the latest version of the code here:

http://code.google.com/p/webdfs/downloads/list

My next blog post will be about what algorithm changes we want to make and why. based off of an analysis of the needs of a cloud like system that stores many small images for a web application.

peace,

-Shane

Wednesday, July 8, 2009

scaling improvements

howdy folks,

I have been speaking a bit with the authors of the core algorithm that I implemented in webdfs and they directed me to some follow up research and papers that contain information on some scaling improvements.

If you are interested, you can read the papers here:

http://users.soe.ucsc.edu/~elm/Papers/ipdps04.pdf

http://www.ssrc.ucsc.edu/Papers/weil-sc06.pdf

these papers are great. I am just now getting into collecting empirical data about webdfs to prove that the implementation does, in fact, do what it is supposed to do and the first paper above has a section on "RUSH in Practice" (section 3) which can serve as a reference when I do my own tests. sweet.

Since I am able to easily swap out the locating algorithm in webdfs I think what I will do before I get into working on the improved implementation of RUSH is test what I currently have and develop a solid test suite that can be used going forward. I know that I should have more fully fleshed out tests before implementation, but I was just too damn excited about doing this. sue me. :P

peace, look for test results soon...

Monday, July 6, 2009

pre-pre-alpha

here it is. the first release of webdfs.

http://code.google.com/p/webdfs/downloads/list

This is the first release and I would not call it ready for production, but it certainly is ready for some pounding. which is exactly what I intend to do.

now that this first release is out, I am going to be focusing on:
  • improving the tests. currently there are tests for the core algorithm but not much else. I want to add tests not only for the main WEBDFS class but also to prove that the system and the code will actually do what it says it does. This will include empirical tests that will show the distribution is correct and that an optimal (minimal) amount of data is moved when new resources are added (new servers)

  • improving and generating documentation.

as things go forward, I will be putting together scaling, maintenance, and performance guides and posting them here.

If anybody happens to read this blog and downloads webdfs and tries it out, please give me any feedback that you might have.

peace mi amigos y amigas,

-Shane

Tuesday, June 30, 2009

Hello World

This post is introducing a project I started about 1 week ago. I have called it webDFS meaning web distributed file system (DFS). The aim of the project is to give the PHP community a DFS that is similar to MogileFS. (which totally rocks) Obviously, Danga was much more clever than I in choosing names for their systems. huh, oh well.

source code is here:

http://code.google.com/p/webdfs/

version 0.00 is imminent.

The biggest reason I have started this project is that I have a love and fascination for distributed, scalable systems and this will be one of many expressions. :) Also, I, of course, will use it in future projects.

webDFS is mostly based on the algorithm described in this paper ( PDF ):

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=1B5D780A8525B36C150B0D028DC73F4F?doi=10.1.1.12.6274&rep=rep1&type=url&i=0

The algorithm comes from a family of algorithms known as the RUSH family; Replication Under Scalable Hashing. If built correctly, a system built on the RUSH algorithms will have the following characteristics: (some the text below is taken from the algorithm whitepaper)
  • Ability to map replicated objects to a scalable collection of storage servers or disks without the use of a central directory.

  • Redistributes as few objects as possible when new servers are added or existing servers are removed

  • Guarantees that no two replicas of a particular object are ever placed on the same server.

  • No central directory, clients can compute data locations in parallel, allowing thousands of clients to access objects on thousands of servers simultaneously.

  • Facilitates the distribution of multiple replicas of objects among thousands of disks. Allows individual clients to compute the location of all of the replicas of a particular object in the system algorithmically using just a list of storage servers rather than relying on a directory.

  • Easy scaling management. Scaling out is just a matter of deploying new servers and then propagating a new configuration to all of the nodes. The data will automatically and optimally be moved to accommodate the new resources.

    De-allocating resources is basically the same process in reverse. Simply deploy the new configuration and the data will be moved off the old resources automatically.After the data has been moved, simply take the old resources off line.

  • Easier server management. Since there is no central directory, there are no master or slaves to configure. No master or slaves means that all resources are utilized and no servers sit unused as "hot" spares or backups.

  • No single point of failure. As long as the replica to node ratio is correct, your data will be safe, redundant, and durable; able to withstand major server outages with no loss.
That's pretty cool. I hope that webDFS will capture all of the above for the PHP and Web communities in a very easy to use and extend package.

ok, I am going back to work on the webDFS. a release is coming soon.

peace,

-Shane