|Description||AWS Simple Storage Service|
|Quality||No Sentry, no SONAR|
AWS Simple Storage Service (S3) is an object storage service. It is used for the following purposes:
- Storing first-class data objects, e.g. Events
- Unbounded general-purpose key-value store, e.g Percolator Checkpoints
- Indexing data obejcts, e.g. Events by prefix
- Serving websites via CloudFront, e.g. the current Event Data User Guide.
S3 storage is durable, and we trust it to store our files. However, unexpected things can happen including human error. Buckets can be automatically replicated, including across regions.
Key value store
The S3 interface is a simple key-value store. This means it can be used in place as a key value store for things like checkpointing. We might otherwise use tools like Redis, CouchDB, MongoDB, ElasticSearch etc. This gives us the benefit of zero maintenance and unlimited scalability. The trade-off is latency, but there are situations where this makes sense.
S3 provides a prefix-based index. This means that queries like “/path/to/” but also “/path/to/12”. Although slashes are used by convention as directory delimiters, they have no significance. This is leveraged in Event Data, which can, for example, retrieve Events by variable prefixes (e.g. “1*”, “ab*” etc) when creating archive files.
S3 has a defined maximum file size per transfer. Command-lines will automatically perform multi-part uploads when the file exceeds the threshold but the standard Java client doesn’t do this unless you specifically request it. When downloading or uploading large files, e.g. archives, if there’s a unexpected error, check the file size. You may need to modify code as appropriate.
S3 offers strict read-after-write semantics when creating a new object (within certain parameters). This means that if an object is stored, it can be retrieved immediately.
When updating an object, the concurrency model is ‘eventually consistent’. The consistency model is a factor in the trade-off.
The S3 interface is implemented by other services, including self-hosted and cloud providers. If necessary we can transition away from S3 by copying the files. The only slightly exotic feature we use is prefix-indexing. If push comes to shove, this functionality can be replicated.