How tricky is to save 20x with AWS Glaciers?
When you accumulate a large amount of data stored within Amazon S3, you begin to think about how to reduce costs. Amazon comes to the rescue with Deep Glacier, a storage option costing 23 times less than S3 Standard! The question immediately arises: Why are we not storing everything there and saving a ton of money?
Obviously, this price difference is not without reason. What exactly is the catch? Amazon provides a large amount of documentation, but it was very difficult for me to understand it. I decided to make a guide for people like me who understand the essence of things through their limitations, and not through a detailed description.
Each one of the AWS Glaciers options have their own specifics.
AWS Glaciers types
No, this is not a typo. I specifically wrote “Glaciers” in plural. AWS has several types of very cold storage, and the name of all of them has the word “Glacier” included. These types are:
- S3 object storage class “Glacier” (accesible via S3 API)
- S3 object storage class “Glacier Deep Archive” (accesible via S3 API)
- S3 Glacier Service (accesible via Glacier API)
Don't be fooled by the names, these are three different products with their own particularities. For clarity, we will refer to them by the following names:
- S3 Glacier
- S3 Deep Glacier
- Glacier Service
What are their main differences? Glacier Service is a separate service for storing files where access is rare. It has it's own pricing page, documentation page, API and all other independent AWS resources. Working with Glacier Service differs from working with S3 even in terms of naming: instead of buckets you create vaults, instead of objects you upload archives; which you can’t even choose a name for (it is generated automatically by the service). And to read something, you need to make a request to the previously uploaded archive in a special way, and wait until it becomes available for reading.
S3 Glacier and S3 Deep Glacier are object storage classes in S3. That is, you can simply change the storage type of your object with a regular S3 request or load it into S3 immediately with this storage class. Internally, S3 acts as a client of Glacier Service and automatically uploads files with the Glacier and Deep Glacier storage classes there. How it combines our S3 objects into Glacier Service archives is unknown, but we don’t need to know this.
To read an object with such a storage class from S3, we need to perform a RESTORE request on our object. Then wait a few minutes, hours or days (you're in control of restore speed, but a faster restore has a higher price), and then the object becomes readable as an object with the S3 Standard storage class.
Understanding the specifics of the different types of Glacier is very important before starting the design of the cold storage system in your project.
Permissions
As usual in AWS, it all starts with permissions. Just keep in mind that you need different permissions for different Glaciers.
S3 Glacier | S3 Deep Glacier | Glacier Service | |
---|---|---|---|
Required permissions | Usual S3 Permissions | Glacier Service Permissions to create vaults and archives. |
Write specifics
S3 Glacier | S3 Deep Glacier | Glacier Service | |
---|---|---|---|
Write process | Specify this storage class at creation time, set it for an existing S3 object, or setup an S3 lifecycle policy to set it for objects automatically. | Calculate tree hash for each upload so the Glacier Service can verify the file was not damaged in transit. When upload is successful, Glacier Service returns the new archive ID, which you must save in order to read from the archive later. | |
Price | You pay $0.05 for every 1000 new objects with this storage class. | Same here, $0.05 for every 1000 new archives. You can't edit archives. |
Prices are correct for US East region, at the time of writing.
Clearly, the per-file write price for cold storage is 10 times higher than S3 Standard. That is, the same amount of information is cheaper to write when it is represented by fewer files.
Creating archives in the AWS cold store is not difficult. Let's proceed to more complex aspects of it.
Storage specifics
Storing data is more fun than writing!
S3 Glacier | S3 Deep Glacier | Glacier Service | |
---|---|---|---|
Storage overhead | Additional 32KB of data per object. On top of that every object requires 8KB of S3 Standard storage to maintain metadata. | — | |
Minimum storage duration, days | 90 | 180 | 90 |
Storage price, USD† | \[\left(1+{N \times{32\mathrm{KB}} \over 1000\mathrm{GB}}\right)\times{4} + {N \times{8\mathrm{KB}} \over 1000\mathrm{GB}}\times{23}\] | \[\left(1+{N \times{32\mathrm{KB}} \over 1000\mathrm{GB}}\right)\times{0.99} + {N \times{8\mathrm{KB}} \over 1000\mathrm{GB}}\times{23}\] | 4.5 |
† — monthly storage cost of 1000GB of data distributed between N objects/archives
Scary formulas are the result of storage overhead. With a fixed amount of data, they depend only on N, but we already knew storing a lot of small files in cold storage was expensive.
However, we now understand it is better to avoid storing temporary or short-lived files in cold storage. It is also clear that S3 Deep Glacier has the cheapest storage.
At Sweatcoin, we had to glue small files with tar into large ones to avoid the cost of a large number of requests and storage of meta-information.
Read specifics
You need to decide how you will read your data from Glaciers, because it will cost both money and time, since the data must be restored before reading.
S3 Glacier | S3 Deep Glacier | Glacier Service | |
---|---|---|---|
Restore process | You should issue a RESTORE request first. You can specify restore speed, and how long the recovered data should be readable (in days). | You should initiate archive retrieval job. After retrieval, the archive is available for reading within 24 hours. | |
Restore speed | 1 minute to 12 hours, depending on the requested restore speed | Within 12-48 hours, depending on the requested restore speed | 1 minute to 12 hours, depending on the requested retrieval speed |
Partial restore | No, you can only restore the whole object | Yes, you can retrieve a megabyte-aligned range of bytes of archive | |
Restore costs |
You pay for:
Exact prices are on the “Requests and data retrievals” tab on this page. |
You pay for:
You do not pay for the temporary storage of retrieved data. |
|
Read process | The restored object is readable as a regular S3 object by its key. | After a retrieval job is complete, you can get its output. | |
Partial read | Yes, as for a regular S3 object | Yes, you can read a megabyte-aligned range of bytes from the retrieved archive | |
Provisioned capacity available | Yes | No | No |
Random restore request limit | Glaciers are designed to handle 35 random restore requests per pebibyte (PiB) stored per day. |
What we learned here:
- The sooner you want to read the file, the more expensive it is
- You can save on read costs if you restore data in pieces, but this can only be done when using the Glacier Service
- “Glaciers are designed to handle 35 random restore requests per pebibyte (PiB) stored per day”.
I spent a very long time trying to find out what the last point means, so I will devote the next section to its explanation.
Random restore request limit
Most likely, Glaciers store data on magnetic tape or optical disks, which is why we are warned about the limitations of random access, because it is difficult and time consuming with that media format. We talked to at least three people from Amazon about the restriction and never received accurate information on how the 35 random restore requests per pebibyte (PiB) stored per day limitation works. However, we have a few facts that will help you not to run into this limit:
- This is neither hard limit, nor a minimum guaranteed number of requests. Apparently it is a guideline. When the Glacier Service is not very busy, it allows you to make a lot more random requests. But you cannot know in advance how supportive it will be on any particular day.
- Bulk restore requests are available for S3 Glacier and S3 Deep Glacier objects. If you need to recover many different files, you should list them all inside one bulk-request, and S3 will read them from Glacier Service optimally. It is argued that within a single bulk request there could be billions of objects to restore.
- All Glaciers have inventory — a list of all objects/archives inside the bucket/vault. It is argued that if you send recovery requests in the recording order that is reflected in the inventory files, then such requests will not be considered random and will not spend the limit.
Conclusion
AWS Glaciers have a lot of nuances. However the general rules for the design of cold storage systems using Glaciers can be stated quite briefly:
- The cheaper the storage, the more expensive and longer the reading
- There are 3 types of Glaciers with their own characteristics
- Storing a lot of small objects is more expensive than a few large ones
- Glacier Service allows you to restore parts of the archive, but S3 Glaciers do not. Otherwise, S3 Glaciers are more convenient and familiar
- Programmers will have to spend time to fully implement and test the flows of writes and reads, taking into account all the nuances
- You'll have to count a lot
- It’s better not to write something to Glaciers you don’t know how you'll read
Given that, we imagine the following use cases of Glaciers:
- Backups
- Logs
- Lossless versions of mediafiles for (possible) future usage
- Replacement of in-house magnetic tape archiving
Thanks
This article wouldn't be possible without Sirag Nabih and Nick Malyon.