batch-inference-benchmarks Comparison feedback

Hey @amogkam !

Thanks for blog post! Although I found the Ray vs other services results impressive I decided to conduct my own experiments on AWS Sagemaker. After reviewing the source code and running my benchmarks, I am ready to share the results and my concerns regarding the comparison approach.

Concerns:

You compared the performance of Ray on parquet data (batched reading, preprocessing, inference) and Sagemaker Batch Transform on image data (single image per request [x4 instance count]).
For Ray you computed script execution time (source code) while for Sagemaker you measured the whole Batch Transform job time (last two cells in the corresponding notebook), which includes instance provisioning, docker image pull, etc.
No cost comparison has been carried out. In Sagemaker, 4 x ml.g4dn.xlarge are ~40% cheaper than 1 x ml.g4dn.12xlarge ($0.736 vs $4.89 per hour for compute only).

Benchmark results (10GB dataset):

Service	Job type	Data	Settings	Throughput (billed time), img/s	Throughput (script time), img/s	Price, $
Ray	Sagemaker Processing job	16 x 190 MB parquet files	Same as in source code	50.26	101.11	0.44
Sagemaker	Sagemaker Batch Transform job	120 x ~25.3 MB parquet files	`max_concurrent_transforms=2`, `max_payload=50`, the rest as in the notebook	58.69	-	0.23

Throughput (billed time), img/s - number of images in a dataset / billed time of a job in seconds
Throughput (script time), img/s - number of images in a dataset / script time in seconds (as measured in source code)

Would love to discuss my results! And please feel free to point out if I missed something or if I'm wrong about anything. Also ping me if any clarifications from my side are required. Thank you!

Jul 11 '23 13:07 morelen17

bump! curious if these concerns have been looked at. The results were recently presented at Ray Summit 2024 but these concerns were not discussed in the presentation, just the original performance numbers.

Oct 06 '24 04:10 rbavery

Thanks for the bumping this and thank you for the thoughtful initial post.

Addressing the concerns:

I had not had a chance to try out the patch, but indeed if the issue with reading parquet files in Sagemaker is fixed with that change, then that should be used for the benchmark.
That is a good callout. Cluster startup time on Anyscale for the Ray benchmark and Databricks for the Spark benchmark should be included. Note that the first row in the results table that is linked is Ray on Sagemaker, which is not one of the reported configurations in the original post. What is reported in the blog post is Ray on Anyscale.
That’s right there was no cost comparison in the blog post. But it can be calculated via the instance pricing in the region that you want and the total job completion time.

With these changes, the exact benchmark numbers will be different, but I don’t expect any major changes in the overall trends/takeaways from the benchmarks. For fully updated numbers, it would probably be best to re run all the benchmarks with more recent updates from the past year.

Oct 07 '24 04:10 amogkam

Thanks @amogkam !

I do think that the 17x sagemaker performance multiple might fall quite steeply if cluster startup and docker pull are included. Curious if plans come up for Anyscale to rerun these, would love to see the results.

Oct 07 '24 18:10 rbavery