dvc-bench
dvc-bench copied to clipboard
s3 benchmarks: recovering from errors
s3 benchmarks are heavily influenced with aws and network moods For example latest @isidentical change had to be rerun until the PR build starts passing. We need to investigate if we can fix such issues, and consider other opitons if we cannot. For example we could consider setting up s3 remote on local MinIO, to omit the network problems.
It seems like the daily run failed because of this, https://github.com/iterative/dvc-bench/runs/2579933406?check_suite_focus=true. @pared what do you think about implementing something like a retry step for the pull the data stage? So that it would restart itself if it fails (most likely from 500 errors) for a certain amount before failing the whole step and cancelling that day's run.
@isidentical that might be a good idea. I still wonder whether we shouldn't try to implement MinIO remote. But I guess retry won't contradict with MinIO if we decide to go this way in the future.
Possible solutions to this issue:
- [ ] - retry pull
- [ ] - migrate from s3 to local MinIO (need to consider possible impact on our benchmarks - MinIO server will require some computational resources)
Feel free to edit and extend the list
If we are only talking about pulling the cats_dogs, then I think both works. But I'm not really inclined to use a local instance for the actual benchmarks as well, since it might mask the cost of the extra API calls that would cost 1-2 seconds each in real world (maybe not so much for an ec2 instance in the same region though it will still cost a lot more than a local instance and be observable).