dvc-bench icon indicating copy to clipboard operation
dvc-bench copied to clipboard

s3 benchmarks: recovering from errors

Open pared opened this issue 4 years ago • 4 comments

s3 benchmarks are heavily influenced with aws and network moods For example latest @isidentical change had to be rerun until the PR build starts passing. We need to investigate if we can fix such issues, and consider other opitons if we cannot. For example we could consider setting up s3 remote on local MinIO, to omit the network problems.

pared avatar May 05 '21 09:05 pared

It seems like the daily run failed because of this, https://github.com/iterative/dvc-bench/runs/2579933406?check_suite_focus=true. @pared what do you think about implementing something like a retry step for the pull the data stage? So that it would restart itself if it fails (most likely from 500 errors) for a certain amount before failing the whole step and cancelling that day's run.

isidentical avatar May 14 '21 12:05 isidentical

@isidentical that might be a good idea. I still wonder whether we shouldn't try to implement MinIO remote. But I guess retry won't contradict with MinIO if we decide to go this way in the future.

pared avatar May 14 '21 13:05 pared

Possible solutions to this issue:

  • [ ] - retry pull
  • [ ] - migrate from s3 to local MinIO (need to consider possible impact on our benchmarks - MinIO server will require some computational resources)

Feel free to edit and extend the list

pared avatar May 14 '21 13:05 pared

If we are only talking about pulling the cats_dogs, then I think both works. But I'm not really inclined to use a local instance for the actual benchmarks as well, since it might mask the cost of the extra API calls that would cost 1-2 seconds each in real world (maybe not so much for an ec2 instance in the same region though it will still cost a lot more than a local instance and be observable).

isidentical avatar May 15 '21 23:05 isidentical