Today’s storage administrators are looking for the best performance for the lowest cost to satisfy their enterprise data requirements. Performance is often improved by adding more solid state drives, but these come at a cost premium. For enterprises looking to save money and resources while meeting performance requirements- data reduction is a key component in creating the ideal solution.
Options for data reduction are classified into two main categories and each has their own purpose- data deduplication and compression.
Inline vs. Post-Process Deduplication
Data deduplication is, in its simplest form, a process of removing duplicates. Take this blog as an example- by the 11th word we already repeated the word “the”. With deduplication, that would remove 3 of the 59 words in the first paragraph. If we did this for every word in the first paragraph we would be down to 47 total words or about a 20% reduction.
There are two methods that storage systems accomplish this with data blocks today- inline or post-process. Inline means that as the data is written, the duplicates are noted and pointers are created instead of writing the full data. Post-process deduplication removes the duplicates on a scheduled basis.
Inline requires more compute resources, but post-process deduplication puts the user at risk if a large file is loaded on the system and fills the capacity before the system has a chance to deduplicate it.
Compression
Compression is the next option to reduce the amount of data stored and it is the one people are typically more familiar with, even if they don’t realize it. Anytime you download a .zip file, you are receiving a compressed file.
To give you a real world example, think of the popular vacuum bags (In the event that you live under a rock and haven’t seen these in action, here is a 13-minute infomercial all about them). Traditionally, when you fold clothes each item takes up a given amount of space. However, when you use space saving bags to remove as much of the air as possible, you reduce the amount of space needed to store the contents. The good thing about compression is that, in many cases, it has a minimal impact on the compute resources of a storage array and can even make the system perform more efficiently.
So which is better?
Both compression and dedupe have a place in the datacenter, but it’s important to understand when each is most effective. You will normally get your highest data reduction ratios with dedupe, but those are primarily going to apply to desktop and server virtualization workloads that have a lot of commonalities.
Even in this case, the use of technologies like linked clones for VMware View can reduce the need for deduplication. Messaging and collaboration tools are another space you will see deduplication used frequently, but this is often built into the application layer and relies less on the storage deduplication.
For most other workloads, compression is ideal- from files in Media and Entertainment, Databases, Analytics and more. Increasingly, you will see the use of compression become the most widely used and effective data reduction strategy.