batch-inference-benchmarks icon indicating copy to clipboard operation
batch-inference-benchmarks copied to clipboard

Comparison feedback

Open morelen17 opened this issue 2 years ago • 3 comments

Hey @amogkam !

Thanks for blog post! Although I found the Ray vs other services results impressive I decided to conduct my own experiments on AWS Sagemaker. After reviewing the source code and running my benchmarks, I am ready to share the results and my concerns regarding the comparison approach.


Concerns:

  1. You compared the performance of Ray on parquet data (batched reading, preprocessing, inference) and Sagemaker Batch Transform on image data (single image per request [x4 instance count]).
  2. For Ray you computed script execution time (source code) while for Sagemaker you measured the whole Batch Transform job time (last two cells in the corresponding notebook), which includes instance provisioning, docker image pull, etc.
  3. No cost comparison has been carried out. In Sagemaker, 4 x ml.g4dn.xlarge are ~40% cheaper than 1 x ml.g4dn.12xlarge ($0.736 vs $4.89 per hour for compute only).

Benchmark results (10GB dataset):

Service Job type Data Settings Throughput (billed time), img/s Throughput (script time), img/s Price, $
Ray Sagemaker Processing job 16 x 190 MB parquet files Same as in source code 50.26 101.11 0.44
Sagemaker Sagemaker Batch Transform job 120 x ~25.3 MB parquet files max_concurrent_transforms=2, max_payload=50, the rest as in the notebook 58.69 - 0.23
  • Throughput (billed time), img/s - number of images in a dataset / billed time of a job in seconds
  • Throughput (script time), img/s - number of images in a dataset / script time in seconds (as measured in source code)

Would love to discuss my results! And please feel free to point out if I missed something or if I'm wrong about anything. Also ping me if any clarifications from my side are required. Thank you!

morelen17 avatar Jul 11 '23 13:07 morelen17

bump! curious if these concerns have been looked at. The results were recently presented at Ray Summit 2024 but these concerns were not discussed in the presentation, just the original performance numbers.

rbavery avatar Oct 06 '24 04:10 rbavery

Thanks for the bumping this and thank you for the thoughtful initial post.

Addressing the concerns:

  1. I had not had a chance to try out the patch, but indeed if the issue with reading parquet files in Sagemaker is fixed with that change, then that should be used for the benchmark.
  2. That is a good callout. Cluster startup time on Anyscale for the Ray benchmark and Databricks for the Spark benchmark should be included. Note that the first row in the results table that is linked is Ray on Sagemaker, which is not one of the reported configurations in the original post. What is reported in the blog post is Ray on Anyscale.
  3. That’s right there was no cost comparison in the blog post. But it can be calculated via the instance pricing in the region that you want and the total job completion time.

With these changes, the exact benchmark numbers will be different, but I don’t expect any major changes in the overall trends/takeaways from the benchmarks. For fully updated numbers, it would probably be best to re run all the benchmarks with more recent updates from the past year.

amogkam avatar Oct 07 '24 04:10 amogkam

Thanks @amogkam !

I do think that the 17x sagemaker performance multiple might fall quite steeply if cluster startup and docker pull are included. Curious if plans come up for Anyscale to rerun these, would love to see the results.

rbavery avatar Oct 07 '24 18:10 rbavery