squad TestComparison: apply transitions while fetching tests

TestComparison is the last bottleneck we currently have. It:

Causes workers to be killed due to OOM, which 1.1 Causes ProjectStatus.create_or_update fail, therefore causing celery_chord in tradefed plugin fail as well
Causes Build Comparison timeout
And I'm sure it causes some serious delay when generating Notification objects, because a comparison object is given to it

The main source of this problem is that we load all tests in memory to then apply transitions (pass->fail, fail->pass, etc), more details here. Bottom line, in most cases, we only need to load a tiny portion of tests that are actually useful.

I want to make TestComparison in a way that we discard tests that don't fit wanted transitions on the fly. I still don't know how yet :)

Nov 19 '20 12:11 chaws

For future references:

I used two builds in staging baseline and target to use as PoC that using build reference in Test table would make things significantly faster. Those builds contain a single testrun only (2840382 and 2840381), each containing 1.3M+ tests.

Normally a build contains many more testruns and there's where the problem relies: we have no direct way of comparing tests from two different builds without joining the testrun table. By adding build_id to Test, we could use build_id directly in the snippet below, instead of testrun_id.

Currently it takes several minutes (maybe 5, maybe 10 depending on load) and almost 5G of RAM just to run ProjectStatus.create_or_update.

I was able to run the same "query", to get fixes and regressions with the snippet below:

baseline_tr = 2840382 
target_tr = 2840381
tests = Test.objects.raw('SELECT t1.* FROM core_test t1, core_test t2 WHERE t1.test_run_id = %d AND t2.test_run_id = %d AND t1.metadata_id = t2.metadata_id AND t1.result != t2.result', [baseline_tr, target_tr])

# The query above will only return baseline tests which had different results if compared to the target tests
# this means that a baseline test with result=False is returned, the same test in the target test has to have result=True
# therefore, this is considered a regression. A fix is the other way around (I'm not taking into account intermittent tests yet)

# regressions_and_fixes[False] -> regressions
# regressions_and_fixes[True] -> fixes
regressions_and_fixes = {True: [], False: []}
for test in tests:
  regressions_and_fixes[test.result].append(test)

Running this snippet takes about 5 seconds and used 22MB of disk space to sort things out:

stagingqareports=> explain analyze SELECT t1.*
FROM
  core_test t1,
  core_test t2 
WHERE 
  t1.test_run_id = 2840382 AND 
  t2.test_run_id = 2840381 AND
  t1.metadata_id = t2.metadata_id AND 
  t1.result != t2.result;
                                                                                 QUERY PLAN                                                                                 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=11417.03..12082.59 rows=2134 width=199) (actual time=3747.691..3756.845 rows=1 loops=1)
   Workers Planned: 1
   Workers Launched: 1
   ->  Merge Join  (cost=10417.03..10869.19 rows=1255 width=199) (actual time=3048.631..3734.013 rows=0 loops=2)
         Merge Cond: (t1.metadata_id = t2.metadata_id)
         Join Filter: (t1.result <> t2.result)
         Rows Removed by Join Filter: 664846
         ->  Sort  (cost=5531.46..5581.40 rows=19976 width=199) (actual time=827.639..1089.421 rows=664847 loops=2)
               Sort Key: t1.metadata_id
               Sort Method: external merge  Disk: 23512kB
               Worker 0:  Sort Method: external merge  Disk: 22488kB
               ->  Parallel Index Scan using core_test_ba18909e on core_test t1  (cost=0.57..2190.07 rows=19976 width=199) (actual time=0.017..348.175 rows=664847 loops=2)
                     Index Cond: (test_run_id = 2840382)
         ->  Sort  (cost=4885.58..4970.47 rows=33959 width=5) (actual time=1518.442..1943.952 rows=1329638 loops=2)
               Sort Key: t2.metadata_id
               Sort Method: external sort  Disk: 24832kB
               Worker 0:  Sort Method: external sort  Disk: 24832kB
               ->  Index Scan using core_test_ba18909e on core_test t2  (cost=0.57..2329.90 rows=33959 width=5) (actual time=0.033..625.049 rows=1329693 loops=2)
                     Index Cond: (test_run_id = 2840381)
 Planning Time: 0.284 ms
 Execution Time: 3770.148 ms
(21 rows)

This timing can be improved if we defined work_mem=32MB in postgres, but I don't want to go there yet.

NOTE: Getting other transitions, e.g. pass -> n/a etc would require other query designs, but I think it would still be very fast and pagination would be much easier.

Dec 01 '20 01:12 chaws

I slept on it and I also think having the environment reference to test table is fundamental to make correct comparisons. There are cases where the same test is run multiple times for different environments.

There are also cases where same test is run multiple times for the same environment. I think this can be solved with the "confidence result" that @mwasilew suggested a while ago.

Dec 01 '20 11:12 chaws

squad squad copied to clipboard

TestComparison: apply transitions while fetching tests

squad
squad copied to clipboard