results-collection Mismatched total test numbers are confusing

E.g. every browser has a different number of total tests for css/ and webgl/ case; in the webgl/ case it can be different by almost an order of magnitude.

What does this mean? If there are tests we aren't running an a given browser, shouldn't that cause us to mark them as failing? So ideally each total should be equal to the currently-highest number.

Aug 24 '17 23:08 domenic

The webgl/ cause seems to be the aggregate effect of many tests that errored out at the beginning, e.g.: http://wpt.fyi/webgl/conformance-1.0.3/conformance/glsl/functions/glsl-function-abs.html

In the context of one browser's test run, we can't know about tests that aren't run. For the purposes of the dashboard metrics, we could cross-check every subtest to generate numbers in the way you describe. However I do vaguely remember there are some tests that purposefully have different numbers of subtests based on which platform is being run (although I can't provide any examples, and tests that do that would be against the spirit of platform predictability).

I think it would be reasonable to mark them not as FAIL but as NOTRUN so devs could distinguish between test failures and tests that can't be run yet for some reason. Most of the work here would be on the front-end when we initially download and structure the data.

Aug 25 '17 00:08 jeffcarp

FWIW I don't think the numbers should be massaged simply to help people make cheap cross-browser comparisons. The differences we are seeing are presumably real, and provide a clue that more investigation is needed into why the differences exist. Padding the results by assuming that the "real" number of tests for each file/directory is equal to the maximum recorded in any browser is misleading and buries useful information.

Aug 25 '17 10:08 jgraham

Right now I believe supremely useful information is being buried: how many tests remain to be worked on to get interop.

Aug 25 '17 12:08 domenic

I don't think we have that information, and I don't think it's particularly useful even if we do. We don't have the information because we have no way of knowing how complete the testsuite is, or how many tests will end up running in a hypothetical perfect implementation. I don't think it's particularly useful because it doesn't tell you how many actual bugs need to be fixed; lots of testsuites have multiple tests failing due to a single issue.

The number you want is probably pretty close to max(tests run for a directory) - tests passing in implementation. I think vendors are very capable of calculating that if they find it useful.

Note that I agree that testsuites should try to run the same number of tests all the time, and particularly should avoid misbehaviour like putting variable data in the test title.

Aug 25 '17 12:08 jgraham

@jgraham, let me try to state this in terms of your use cases from https://github.com/GoogleChrome/wptdashboard/issues/83#issuecomment-323409326.

Find technologies that are widely implemented and so should be promoted as safe to use. The current state is confusing, as a technology might appear to be widely implemented when you see 995/1000 for one browser, and 9/10 for other browsers. All browsers are doing pretty well, it looks like; they all pass at least 90% of the tests! Of course this is incorrect. It'd be better for this use case to display 995/1000, and then 9/1000 x3, to make it clear that the technologies are not widely implemented.
As a browser developer, find areas in which my implementation has interop issues that can be fixed. Again this is difficult to discern with the current setup. In the previous example, it appears I only need to fix one test. So probably that folder is not worth investing much time in. In reality, I need to fix 991 tests, but that is not apparent from the dashboard.
As a test author, find tests that are suspiciously not passing in any browsers I believe implement the specification. This isn't really affected, since in this case someone is looking at the whole row, and drilling down into specific tests, instead of looking at directory or test totals.
Find areas of the Web Platform that have interop concerns and use it to prioritise future engineering effort. This runs into similar problems as 1. If it appears that everyone's doing pretty well, when in fact some browsers are not running the majority of tests, important areas of non-interop will be missed.

Aug 25 '17 17:08 domenic

I think people can notice that different implementations ran different numbers of tests, and infer that something is worth investigating. Evidence: you noticed and filed this bug. I also think that if people are naively computing percentage scores we have big problems even in the case where there are the same number of tests running in each implementation.

If you think this is too hard to notice or people might not understand how to interpret it, I can get behind the idea of a little warning icon, with a title text like "Implementations got results for different tests on the same revision! This suggests an interop problem.". I would also be in favour of including any differences between tests run as a data point if we change the presentation to reflect interoperability rather than browser performance.

I continue to believe that massaging the numbers to make it easy to compute a score, but hard to see what actually happened, is unhelpful to the goals here.

Aug 25 '17 17:08 jgraham

This is one of the things that initially threw me off and I still don't understand why the test totals are different. As a web developer, I would like to see how well browsers perform and that if I ran into problems on a particular browser, how well that particular feature is covered. By having different test totals, I am not able to understand which browser is doing well compared to others. If a browser fails 50% of the tests, but its total was 1000 might be just fine compared to a browser that succeeds 90/100. I am unable to reasonably determine which case is better or worse and to determine whether a certain browser has a lot of issues with a particular feature or not.

I understand that massaging the numbers is not preferred, but maybe a different solution can be found to show the relative difference between browsers.

Aug 26 '17 18:08 TimvdLippe

I've also been investigating this issue since Friday. @bobholt sent me the link to this thread earlier today, after I presented him with a report of how many test files are producing erroneous test counts (3873 from the e511e5e8af test run). I'm working towards producing a useful report of the data I've collected.

This is one of the things that initially threw me off and I still don't understand why the test totals are different.

At this point, I can only say that there are a lot of reasons why this is happening. Hopefully I will know more once my report on the e511e5e8af run is complete.

Aug 28 '17 17:08 rwaldron

Hello again, @boazsender and I have prepared a "report" that illustrates all of the disparities in test completion by test file; each disparity record includes platform, sha, "did not run" info, completion counts, and expectation counts. The report is built on static data that was derived from the e511e5e8af build that is presently the active result set on wpt.fyi. Using that sha to generate these urls:

https://storage.googleapis.com/wptd/e511e5e8af/chrome-62.0-linux-summary.json.gz
https://storage.googleapis.com/wptd/e511e5e8af/edge-15-windows-10-sauce-summary.json.gz
https://storage.googleapis.com/wptd/e511e5e8af/firefox-57.0-linux-summary.json.gz
https://storage.googleapis.com/wptd/e511e5e8af/safari-10-macos-10.12-sauce-summary.json.gz

The content was then used to generate a consolidated data set, which is available here: https://raw.githubusercontent.com/bocoup/wpt-error-report/master/consolidated.json. This data was then used to generate the report shown here: https://bocoup.github.io/wpt-error-report/

Aug 29 '17 20:08 rwaldron

TypeError: results.import is undefined[Learn More] wpt-error-report:30:7
	<anonymous> https://bocoup.github.io/wpt-error-report/:30:7

Aug 29 '17 20:08 jgraham

Thanks for sharing this Rick! I took a look at the report but I'm having a bit of trouble grokking what it's saying. Is the report mostly just pointing out all of the test files that aren't getting run on at least 1 browser?

Aug 29 '17 20:08 drufball

@jgraham Thanks, go ahead and check that again.

@drufball this is just a consolidated report of only the problematic test files, which can used as a "work list" for rectification.

Aug 29 '17 20:08 rwaldron

Looking at some of the errors, is it also a problem of differentiating between:

some tests aren't reporting results because of timeouts (such as not triggering the proper event or callback), so effectively the lack of result should be treated as a failure
some tests aren't executed because the test was ill-conceived.

Aug 29 '17 21:08 plehegar

Thanks for fixing the page (and for doing this work in the first place of course). It's disappointing that you're still using a nonstandard feature (HTML imports) in an interoperability report…

The numbers are a little hard to follow, but I think plh's taxonomy is pretty much correct. In lots of cases it seems like missing results are because the test doesn't define all the subtests upfront so if the implementation isn't passing the tests it gets ERROR or TIMEOUT for the parent and is missing subtest results. This can be avoided in the test itself, but empirically people don't do that.

Aug 29 '17 21:08 jgraham

Good analysis @plehegar. @rwaldron and I spent some time yesterday fleshing this out on the report as a starting point for new WPT contributors (check https://bocoup.github.io/wpt-error-report again to see the updates). These inconsistencies simultaneously represent a good backlog for WPT stability, and a good set of starter issues for new contributors. As an aside, I think we could consider guarding against tests that don't run to completion in all target browsers with the PR build bot and . CC @bobholt @foolip.

Sorry about that @jgraham, that's my fault. I introduced the import eager to try it, and having not kept track of it's status :blush:. bocoup/wpt-error-report/pull/3 will remove this non-standard API usage in favor of a fetch after @rwaldron's review.

Aug 30 '17 14:08 boazsender

@plehegar

some tests aren't reporting results because of timeouts (such as not triggering the proper event or callback), so effectively the lack of result should be treated as a failure some tests aren't executed because the test was ill-conceived.

Yes, both of these points are 100% correct. Unfortunately, the summary data that we're using to build this does not include the failure "output". As Boaz said, this is a first pass and we hope to flesh out and surface more interesting and useful data as it becomes available. For now, it's up to a hopeful contributor to investigate the cause of failure.

Aug 30 '17 14:08 rwaldron

@plehegar for now we're going to update the report to include links to the test runs at wpt.fyi, ie. http://wpt.fyi/2dcontext/compositing/2d.composite.canvas.xor.html

Aug 30 '17 14:08 rwaldron

As an aside, I think we could consider guarding against tests that don't run to completion in all target browsers with the PR build bot

I think that probably adds an intolerable burden for test authors (and is kind of impossible in the general case; if a test crashes or times out due to a browser bug it's unreasonable to expect all the subtests to have results).

Honestly I think the size of the problem here is being overstated. I don't think the goal of this dashboard should be to provide camera-ready numbers for advertising pass rates. I think it should be to guide browser developers to areas of poor interoperability. So there are several things that might improve the situation:

Separate out parent tests and subtests for test types where this distinction makes sense (i.e. testharness tests). The number of parent tests is much more likely to be constant than the total number of subtests.
Don't separate out the results into PASS/NOT PASS, but put the actual number of each status. A CRASH is more serious than an TIMEOUT is more serious (often) than an ERROR is more serious than a FAIL. Knowing that tests are ERRORing in an implementation immediately suggests a reason for a lack of subtest results, and suggests that the feature is unimplemented (or another feature that happens to be used in the test is).
Develop a heuristic for interoperability (c.f. Hipmunk's "agony") that accounts for all the data, not just "percentage of tests passing", and also includes human input about the perceived completeness of the testsuite.

Aug 30 '17 18:08 jgraham

The use cases in https://github.com/GoogleChrome/wptdashboard/issues/98#issuecomment-324984162 are great, thanks @domenic! As we're planning future work for the dashboard, in particular the last 3 are ones at the top of my mind, together with tracking/understanding failures.

@rwaldron @boazsender, thanks for putting together https://github.com/GoogleChrome/wptdashboard/issues/98#issuecomment-325788897, that's very helpful.

We have at least two problems here:

In some cases, different numbers of tests are due to bugs that can and should be fixed.
Current numbers won't quite make sense unless the set of tests run are the same everywhere.

Unfortunately, I'm skeptical that we could fix this only by changing tests, there could be real and unfixable failures in "setup" tests that make it senseless to instantiate the rest of the tests. It'd almost always be possible to change the tests to do it anyway, but for the legit cases it would be better to handle it in the analysis stage instead, keeping the tests simple.

As for summary numbers, we might not keep the presentation we currently have on wpt.fyi, and some ways of presenting the data would create perverse incentives and are worth avoiding. Still, I think the suggested NOTRUN or some other special status as suggested in https://github.com/GoogleChrome/wptdashboard/issues/98#issuecomment-324789066 would make sense, so that the total number of tests is the union of all test names in all browsers. Comparing percentages could still be very misleading, but for an "interop" metric it'd be useful to know what the total number of unique tests is.

Aug 31 '17 13:08 foolip

Unfortunately, I'm skeptical that we could fix this only by changing tests,

That's a healthy skepticism :) it's not possible to fix all of these by changing tests—many will be fixed by fixing/changing infra and tools.

Aug 31 '17 16:08 rwaldron

Oops, credit actually goes to @jgraham for the list in https://github.com/GoogleChrome/wptdashboard/issues/98#issuecomment-324984162, sorry :)

Sep 01 '17 12:09 foolip

We have split this into different repos, and this issue arguably belong in https://github.com/web-platform-tests/wpt.fyi now. However, I haven't moved it because I'm not quite sure how to summarize the problem.

The subtest names and count might not add up, and IMHO it would not be time well spent to try to enforce that it is. Having one test create another so that failing the first never creates the second is sometimes silly, but not always, so instead we'll have to live with this situation.

Under the "live with it" assumption, we could either just show that the numbers are different, or create a unified view where any test run in any browser is counted. But that only works if comparing the runs for the same commit of wpt, and it will occasionally be useful to compare two different commits.

Concrete suggestions as issues in https://github.com/web-platform-tests/wpt.fyi appreciated.

Apr 17 '18 10:04 foolip

This seems like a web-platform-tests/wpt.fyi issue to me. It has to do with results presentation on the wpt.fyi web property.

Apr 17 '18 15:04 mdittmer

@lukebjerring can you move this issue to wpt.fyi?

Oct 22 '18 15:10 foolip

results-collection results-collection copied to clipboard

Mismatched total test numbers are confusing

results-collection
results-collection copied to clipboard