test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Move perf benchmark and stats data to a dynamo table

Open ZainRizvi opened this issue 11 months ago • 1 comments

We shouldn’t really be uploading original data like our perf benchmarks directly to Rockset. Rather, it should be uploaded to AWS first and then be imported to Rockset.

One of our goals is to not loose data in case Rockset ever becomes unavailable, which is why we usually import data into Rockset from AWS. Now, if we want to take rockset data, process it, and store it back into Rockset, that's still okay, since the original data remains in AWS and can be recomputed if necessary, but original data like our benchmark info should prob go somewhere else first.

(Aside: One could argue that even this is actually a Bad Thing since it’s another set of logic the team will need to learn the caveats of…say for example, knowing that they should limit upload batch sizes to 5000)

Context: https://github.com/pytorch/pytorch/pull/107095#pullrequestreview-1577370920

ZainRizvi avatar Aug 14 '23 18:08 ZainRizvi

In light of the fact that upload-stats env that is used to upload benchmark and stats data to Rockset uses ROCKSET_API_KEY and these jobs cannot be limited to only protected branches, i.e. main. Also, Rockset probably doesn't support openID authentication AFAIK. We should prioritize this task.

The side effect of writing benchmark and stats data to dynamo or S3 is that the connection from GitHub runner to AWS services has OIDC.

huydhn avatar Dec 12 '23 01:12 huydhn