light-speed-io
light-speed-io copied to clipboard
High performance cloud object storage (for reading chunked multi-dimensional arrays)
Some notes (in no particular order) about speeding up reading many chunks of data from cloud storage.
General info about cloud storage buckets
In general, cloud storage buckets are highly distributed storage systems. They can be capable of very high bandwidth. But - crucially - they also have very high latencies of 100-200 milliseconds for first-byte-out.
Contrast this with read latencies of locally attached storage: HDD: ~5 ms; SSD: ~80 µs; DDR5 RAM: 90 ns. In other words: cloud storage latencies are two orders of magnitude higher than a local HDD; and cloud storage latencies are four orders of magnitude higher than a local SSD!
How many IOPS can we get from cloud storage to a single machine?
AWS docs say:
Your applications can easily achieve thousands of transactions per second in request performance when uploading and retrieving storage from Amazon S3. Amazon S3 automatically scales to high request rates [although the scaling takes time]. For example, your application can achieve at least 3,500 PUT/COPY/POST/DELETE or 5,500 GET/HEAD requests per second per partitioned Amazon S3 prefix. There are no limits to the number of prefixes in a bucket. You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second...
Some data lake applications on Amazon S3 scan millions or billions of objects for queries that run over petabytes of data. These data lake applications achieve single-instance transfer rates that maximize the network interface use for their Amazon EC2 instance, which can be up to 100 Gb/s on a single instance. These applications then aggregate throughput across multiple instances to get multiple terabits per second.
(Also see this Stack Overflow discusion)
So... it sounds like we could achieve arbitrarily high IOPS by using lots of prefixes.
How fast can a single VM go? A 100 Gbps NIC could submit 20 million GET requests per second (assuming each GET request is 500 bytes). And a 100 Gbps NIC could receive 2 million 4 kB chunks per second (assuming a 20% overhead, so each request is 5 kB); or 2,000 x 4 MB chunks per second.
To hit 2 million IOPS, you'd need to evenly distribute your reads across 364 prefixes.
2 million requests per second per machine is a lot! It wasn't that long ago that it was considered very impressive if a single HTTP server could handle 10,000 RPS. I'm not even sure if 2 million RPS is actually possible! Here's a benchmark of a proxy server handling 2 million RPS, on a single machine.
I'd assume that, in order for the operating system to handle anything close to 2 million requests per second, we'd have to submit our GET requests using io_uring.
Some "tricks" that LSIO could use to improve read bandwidth from cloud storage buckets to a single VM:
- Submit many GET requests concurrently (already planned)
- GET large files in multiple "parts" (using the HTTP
Range
header, orpartNumber
), and download the parts in parallel (already planned) - Stripe data across many prefixes (like RAID 0). (LSIO could act just like a RAID controller: LSIO's user would read and write single files. Under the hood, LSIO would stripe that data over a use-configurable number of prefixes. The RAID chunk size would be user-configurable. Or, perhaps, when the user submits writes as a sequence of arbitrary-sized chunks (like Zarr) then LSIO could split those chunks across multiple S3 prefixes. And the docs could tell users that chunks which are often read together should be close together in the submitted sequence of chunks. That works for striping. And works for merging reads into sequential reads for HDDs.
More info about high-speed IO from Google Cloud Storage:
For cloud buckets, re-implement the cloud storage API using io_uring. Although, cloud storage buckets can be severely limited in terms of read IOPs. For example, the Google Cloud Storage docs state that "buckets have an initial IO capacity of.. Approximately 5,000 object read requests per second, which includes listing objects, reading object data, and reading object metadata." Although the documentation continues on to say "A longer randomized prefix provides more effective auto-scaling when ramping to very high read and write rates. For example, a 1-character prefix using a random hex value provides effective auto-scaling from the initial 5,000/1,000 reads/writes per second up to roughly 80,000/16,000 reads/writes per second, because the prefix has 16 potential values." Which suggests that a longer random prefix could increase read IOPs indefinitely. The "quotas" page says the "Maximum rate of object reads in a bucket" is "unlimited". Bandwidth is limited to 200 Gbps (about 20 GB/s). AWS S3 has similar behaviour (also see this Stack Overflow question). Although this does raise a problem with Zarr sharding: you might actually want to randomly take chunks which are nearby in n-dim space, and spread them randomly across many shards, where each shard has a random prefix in its filename (although that forces you to make lots of requests. It may be better to keep nearby chunks close together, and coalesce reads).
Another "trick" to improve performance: use HTTP/3 endpoints:
https://www.linkedin.com/pulse/http-10-vs-11-20-30-swadhin-pattnaik
GCS supports HTTP/3.
Also, on a different topic: when LSIO merges byte ranges, it should avoid hitting API rate limits for the number of GET requests
Although, it sounds like http/3 isn't always faster.
And, in object_store
, using http/2 with GCS is currently considerably slower than using http/1.1 (that issue also has some interesting benchmarks showing 1.3 GB/s from GCS with 20 concurrent connections using http/1). Although that might be due to an issue with hyper
.
Related
- https://github.com/apache/arrow-rs/issues/5194
- https://github.com/hyperium/hyper/issues/1813
- https://github.com/hyperium/hyper/issues/3071
One other thought: Z-Order will be very important for fast reads on cloud object storage.
Also note that, when performing ranged GETs on GCS, the checksum is computed for the complete object. So the checksum is not useful for ensuring the integrity of a single chunk. So we'll need checksums for each chunk in the Zarr metadata, I guess.
Specific cloud VMs with high-performance networking (VM-to-VM bandwidth)
AWS
In summary: The AWS instances with the fastest network adaptors tend to be the "compute optimised" instances, where their name ends in n
(for networking).
-
c5n
: 100 Gbps (if you have 72 vCPUs!). Intel CPUs. -
c6gn
: 100 Gbps. AWS Graviton2 CPUs. -
c6in
: 100, 150, 200 Gbps. Intel CPUs. -
c7gn
: 100, 150, 200 Gbps. AWS Graviton3D CPUs.
AWS instances with fast networking and a GPU:
-
P5
: "up to 3,200 Gbps of EFAv2 networking", nvidia H100s, "support Amazon FSx for Lustre file systems so you can access data at the hundreds of GB/s of throughput and millions of IOPS required for large-scale DL and HPC workloads". -
G6
: 100 Gbps, nvidia L4 GPUs, AMD EPYC. -
G4dn
: 100 Gbps, nvidia T4 GPUs, custom Intel Cascade Lake CPUs. -
P3dn
: 100 Gbps, nvidia V100 GPUs, Intel Xeon (Skylake).
Google Cloud
See:
- https://cloud.google.com/compute/docs/network-bandwidth#summary-table
- https://cloud.google.com/compute/all-pricing#high_bandwidth_configuration
In summary: Some VMs support 100 Gbps. Some 200 Gbps. Some 1,000 Gbps!!
References
- See the "Acknowledgments" section of the Vortex README for some great links on high-throughput compression. Especially:
- The BtrBlocks paper (2023). Shows that parquet+zstd or parquet+snappy are CPU-bound when decompressing data from 100 Gbps S3. parquet+zstd gets 25 Gbps; parquet+snappy gets about 30 Gbps. In contrast,
btrblocks
achieves very close to the 100 Gbps max available bandwidth.
- The BtrBlocks paper (2023). Shows that parquet+zstd or parquet+snappy are CPU-bound when decompressing data from 100 Gbps S3. parquet+zstd gets 25 Gbps; parquet+snappy gets about 30 Gbps. In contrast,
Papers to read
- Jens Axboe: io_uring and networking in 2023
- Analyzing the Performance of Linux Networking Approaches for Packet Processing - a 2023 MSc thesis. Contains good introduction to the relevant software architectures, and interesting results.
io_uring:
- Zero-copy network rx: https://www.youtube.com/watch?v=LCuTSNDe1nQ (but this work is still at an early stage: for example, it doesn't yet support containers or VMs; only supports Broadcom hardware; the API requires the application to manage lots of new rx queues (so the Rust io_uring crate will need quite a lot of new features))
- Zero-copy network tx: https://lwn.net/Articles/879724/
- Kernel 6.10 improves zero-copy network tx: https://www.phoronix.com/news/Linux-6.10-IO_uring
Also see: 2023: Google Posts Experimental Linux Code For "Device Memory TCP" - Network To/From Accelerator RAM. "If Devmem TCP takes flight, it will allow the accelerator/GPU/device memory to be exposed directly to the network for inbound/outbound traffic."
The zero-copy network rx io_uring folks are talking to the Google folks.
See this paper:
"Exploiting Cloud Object Storage for High-Performance Analytics", Dominik Durner, Viktor Leis, and Thomas Neumann, PVLDB 16 (11), 2023, 49th International Conference on Very Large Data Bases
(Hat tip to Nick Gates for telling me about this paper!)
Very interesting stuff. Lots of useful graphs. Including:
- Detailed, quantitative analysis of performance on various cloud object storage platforms (mostly focused on AWS). And a description of the architecture of cloud object storage systems and pricing.
- Compelling evidence that you need to run on the order of 200 concurrent requests in order to saturate a 100 Gbps network connection to AWS S3.
- Existing cloud object storage libraries spin up one operating system thread per network request. So, with 200 concurrent requests, you spin up 200 threads. That can be expensive.
- The authors develop a download manager called AnyBlob (implemented in C++) which uses io_uring, and shows better performance than AWS's C++ libraries:
(note to self: I should aim for a similar graph for LSIO)
https://dynamical.org are doing a great job, publishing NWPs as Zarr. Maybe one of my areas of focus could be to provide high-performance read access to data on dynamical.org? e.g. to train ML models directly from dynamical.org's Zarrs?