What You Need to Know About Data Storage
Written by: Marc Creviere, Principal Systems Engineer, US Signal
Object storage was conceived in the late 1990s to address what was, even then, becoming an issue. The issue: the need to store and access vast amounts of data in sometimes rapidly expanding datasets in an efficient and scalable way.
That need is even greater today given what can only be called a data explosion. It’s estimated that every day, internet users alone are creating 2.5 quintillion bytes of data. Analysts predict that by 2021, every person will generate about 1.7 megabytes of data per second.
While object storage isn’t the only data storage option, for many use cases it is the preferred cloud storage choice. To understand when and how to use object storage, you should first understand another storage choice —block storage.
Block vs Object Storage
Traditional block storage splits each file into evenly sized ‘blocks’ of data. Each block has its own address, and represents a physical location on storage media. Blocks are unique within the volume they are a part of, but are bound by the limits of that volume.
The number and size of blocks determines the size of the volume. Metadata (data about the data) is stored in these blocks as part of the data being stored. This means that if a host system doesn’t already know the exact block location of the data it’s looking for (i.e., it is unstructured), it has to scan the full data content of a potentially large number of blocks in order to find and retrieve the desired data.
This can result in multiple IO operations for a single read request, and is compounded when blocks are highly fragmented across storage media. This may be fine for a relatively small dataset, but clearly doesn’t scale. Block storage allows for only certain blocks to be changed individually and updated in a very granular way. If you had to re-write an entire database file because you added one row, performance would be untenable.
Object storage, by comparison, stores metadata as a component of the object. Each object is made up of three parts – the data, the metadata, and a unique identifier. In order to know what the data is, the application only has to scan the metadata for each object.
As opposed to block identifiers, which are unique and bound to a single volume, object identifiers are unique within an object cluster. The cluster can span multiple drives, physical servers, and even different physical data centers. This allows a single object cluster to scale to massive proportions. Objects can’t be edited in place, but must be downloaded in their entirety, updated and then re-uploaded.
Object storage isn’t suited for applications where existing files (potentially quite large files) need to be modified. All operations with object storage are performed via an HTTP-based API. These data operations fit into familiar web-type functions such as PUT and GET.
Object Storage Use Cases
So when does it make sense to consider an object storage solution? Object storage is primarily used when you have unstructured datasets with data that is stored and retrieved in its entirety (though it has likely been broken up into multiple smaller fragments on the back end). Examples include static content such as photos, music, backups, etc. Think of when you’re uploading or viewing pictures on Facebook or listening to music on Spotify. Their servers always want to store or retrieve the full media files. In both cases, retrieving only a portion of the required data would not be useful or efficient.
Because of its massive scalability and ability to run on commodity hardware in a distributed fashion at comparatively low cost, object storage has been a natural technology for backup software vendors to embrace as a target for storing backup data. Because it’s accessed via API over an HTTPS interface, it can be made ubiquitously available either via public-facing providers such as US Signal, or privately for organizations choosing to maintain their own object storage implementations.
That doesn’t mean that block storage is going away. Structured data, such as that in relational databases, often requires and performs well with granular control of blocks and low latency access to the underlying media. Operating systems require you to be able to open and edit files in place rather than issuing GET and PUT requests for entire files at a time.
Some vendors have even implemented block storage that uses object storage in the background. VMware’s vSAN is a notable example of this technology. If you’ve ever wondered why vSAN is a single volume (datastore) and can be scaled out by adding additional nodes, or how it can have multiple redundancy options within that one datastore, it’s because underneath the hood it’s an object storage system.
What to Know as a Consumer
Selecting a data storage solution — and a provider for it — can be daunting. And nobody likes being surprised by unexpected costs. Asking the following questions can help ensure you get what you need.
What is the base rate for storage?
All providers charge a certain rate per GB or TB of data. Sometimes this is based on actual consumption or on a quota or maximum.
Do I get charged for uploading data?
With some providers, there is a per GB charge to transfer data into object storage. This varies by provider and can even differ between tiers with some providers.
Do I get charged for downloading data?
Similar to uploading, some providers charge for you to retrieve data – often at a higher rate.
Do I get charged for the number of API commands issued?
Different applications/backup platforms manage storage differently. You could incur costs based on their implementation of object storage access and use.
Is there a minimum amount of time I need to keep my data there?
Some providers have a minimum storage time for data, typically in cheaper long-term archive tiers. For example, if the minimum is 90 days, and you upload data but delete it that same day, you still pay for that used capacity for the full 90 days.
Is there a delay in data retrieval?
In order to make some tiers less expensive, some providers use power management to save on costs. They use slower rotational drives and, in most cases, limit how many are powered on or spinning at full speed at a given time. This can increase data retrieval times by a factor of hours. Depending on the urgency of your retrieval needs, this could cause unwanted delays in an important restore operation.
Is data available immediately after upload?
Most current solutions use what is called strong consistency, wherein data is immediately available to be read/listed as soon as it has been written. In some cases, delays in replicating data between cluster locations may cause data to not be fully available until consistency is reached between redundancy partners.
Does my backup vendor support the tier I am choosing?
Given the variation in features and performance of various storage tiers, there may be restrictions on what services can be used with the platform.
The US Signal Advantage
US Signal Cloud Storage eliminates guesswork and simplifies data management and cost accounting by only charging for data consumed. That’s it. One line item. No ingress/egress. No limit on the number of API requests. No delays in data availability. No minimum retention periods.
US Signal also supports both S3 and Swift API transactions for maximum compatibility across software vendors and use cases. You get highly available REST API gateways, paired with multiple location options, and secure data transport via SSL encryption – all for one low cost.
To learn more about cloud storage options, talk to a US Signal solution engineer. Call us at 866.2.SIGNAL or email: [email protected]