test-infra icon indicating copy to clipboard operation
test-infra copied to clipboard

Support fuzzy string matching to compare failures

Open huydhn opened this issue 1 year ago • 3 comments

Use Jaro-Winkler string matching to compare failures. This helps the case where there are random generated string in the error, for example, https://github.com/pytorch/pytorch/pull/114697. For example,

jaroWinkler(
  "/tmp/pip-install-1ffb916n/fbgemm-gpu_a232bb6f0fa24cea8b498f73f367969c/fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp:129:7: error: ‘optTypeMetaToScalarType’ was not declared in this scope; did you mean ‘c10::optTypeMetaToScalarType’?", 
  "/tmp/pip-install-g1l1attb/fbgemm-gpu_a8335f2b184946059273dcfd4193adee/fbgemm_gpu/src/sparse_ops/sparse_ops_cpu.cpp:129:7: error: ‘optTypeMetaToScalarType’ was not declared in this scope; did you mean ‘c10::optTypeMetaToScalarType’?"
) returns

0.8987928326805548

So I set the threshold to be 0.85, and try it out. A threshold of 1.0 is the same as === string comparison.

Testing

Failures on https://github.com/pytorch/pytorch/pull/114697 are correctly shown as flaky and broken trunk

curl --request POST \
--url "http://localhost:3000/api/drci/drci?prNumber=114697" \
--header "Authorization: TOKEN" \
--data 'repo=pytorch'

huydhn avatar Dec 02 '23 02:12 huydhn

@huydhn is attempting to deploy a commit to the Meta Open Source Team on Vercel.

A member of the Team first needs to authorize it.

vercel[bot] avatar Dec 02 '23 02:12 vercel[bot]

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name Status Preview Comments Updated (UTC)
torchci ✅ Ready (Inspect) Visit Preview 💬 Add feedback Dec 2, 2023 2:09am

vercel[bot] avatar Dec 02 '23 02:12 vercel[bot]

Do you mind also check that the threshold is enough to prevent similar test names from being marked as the same? im also just generally interested in what would be counted as similar based on this

That's a fair point. Let me find more examples to support/against it.

One the other hand, this looks more flexible than the current way we compare failures, so I think I could set the threshold to 1.0 here if we're not entirely sure and tweak this value later.

huydhn avatar Dec 04 '23 22:12 huydhn