hydra icon indicating copy to clipboard operation
hydra copied to clipboard

search feature often times out and results in error 500

Open sbourdeauducq opened this issue 3 years ago • 6 comments

e.g. https://hydra.nixos.org/search?query=sd_image_raspberrypi4

500 Internal Server Error DBIx::Class::Storage::DBI::_dbh_execute(): DBI Exception: DBD::Pg::st execute failed: ERROR: canceling statement due to statement timeout [for Statement "SELECT me.id, me.finished, me.timestamp, me.project, me.jobset, me.jobset_id, me.job, me.nixname, me.description, me.drvpath, me.system, me.license, me.homepage, me.maintainers, me.maxsilent, me.timeout, me.ischannel, me.iscurrent, me.priority, me.globalpriority, me.starttime, me.stoptime, me.iscachedbuild, me.buildstatus, me.size, me.closuresize, me.releasename, me.keep, me.notificationpendingsince FROM builds me JOIN projects project ON project.name = me.project JOIN jobsets jobset ON jobset.id = me.jobset_id WHERE ( ( iscurrent = ? AND job ILIKE ? AND jobset.hidden = ? AND project.hidden = ? ) ) ORDER BY project, jobset, job LIMIT ?" with ParamValues: 1='1', 2='%sd_image_raspberrypi4%', 3='0', 4='0', 5='11'] at /nix/store/ym9kzsdswmjlwrmj140jrq29lpgc5jjw-hydra-0.1.20210202.bc12fe1/libexec/hydra/lib/Hydra/Controller/Root.pm line 470

sbourdeauducq avatar Mar 28 '21 04:03 sbourdeauducq

Just had the same happen trying to look up https://hydra.nixos.org/search?query=chromium-unwrapped

deviant avatar Apr 19 '21 15:04 deviant

Encountered same issue with:

  • https://hydra.nixos.org/search?query=language-docker_9_3_0
  • https://hydra.nixos.org/search?query=docker_9_3_0

melg8 avatar Apr 30 '21 08:04 melg8

Same with https://hydra.nixos.org/search?query=libreoffice.x86_64-linux

Also, the jobs table after "The following builds match your query:" is shown twice! (And within the first table some jobs are shown twice, see screenshot). This may of course be completely unrelated. image

dschrempf avatar May 12 '21 20:05 dschrempf

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/consistent-500-from-search-on-hydra-nixos-org/15125/2

nixos-discourse avatar Sep 21 '21 07:09 nixos-discourse

These two queries are problematic: https://github.com/NixOS/hydra/blob/4f308b1f2f8b6298d1288cf63544c3354392624c/src/lib/Hydra/Controller/Root.pm#L472-L483

They're alike (pun intended) as far as constraints go:

  1. Buildoutputs.path ILIKE %query%
  2. Builds.drvPath ILIKE %query%

The first one is already timing out, means I can't tell if the second one is any faster.

Reproducer

I created a similar database layout and filled it with 10M entries in Builds as well as Buildoutputs:

  • PostgreSQL 14, Skylake i5, NVMe
  • Hydra has 163M entries in Buildoutputs (~x16)
  • random selection from 83k unique nix names
  • all path/drvPath are unique
  • ignored all rows not used in WHERE
  • added the indexes from hydra.sql that seem relevant
CREATE TABLE Builds (id serial primary key not null, nixName text, drvPath text not null);

create table BuildOutputs (
    build         integer not null,
    name          text not null,
    path          text not null,
    primary key   (build, name),
    foreign key   (build) references Builds(id) on delete cascade
);

Indexes:

Indexes:
    "buildoutputs_pkey" PRIMARY KEY, btree (build, name)
    "indexbuildoutputsonpath" btree (path)
    "indexbuildoutputspath" hash (path)

The timed out query minimized:

test=# EXPLAIN ANALYZE SELECT * FROM buildoutputs b WHERE b.path ILIKE '%libreoffice.x86_64-linux%';
                                                             QUERY PLAN                                                              
-------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=1000.00..213088.33 rows=1000 width=98) (actual time=4616.444..4628.821 rows=0 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Seq Scan on buildoutputs b  (cost=0.00..211988.33 rows=417 width=98) (actual time=4593.026..4593.027 rows=0 loops=3)
         Filter: (path ~~* '%libreoffice.x86_64-linux%'::text)
         Rows Removed by Filter: 3333333
 Planning Time: 0.640 ms
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.390 ms, Inlining 0.000 ms, Optimization 1.003 ms, Emission 7.864 ms, Total 10.257 ms
 Execution Time: 4629.659 ms
(12 rows)

Time: 4631,313 ms (00:04,631)

So it performs a sequential scan. Had this been the full Hydra DB then it would take about 16x4631 = ~74 s.

As for why the indexes are useless:

  • indexbuildoutputspath (hash) only works for equality
  • indexbuildoutputsonpath (btree) only works for LIKE (not ILIKE), if the query is left anchored and if indexed with text_pattern_ops or a server C locale

I suppose the only reason it sometimes works if it happens to gather enough rows (20?) before hitting the timeout.

Intermezzo: speeding up the sequential scan

Was curious to see if we can do better than 4600 ms without indexes. Yes we can.

  • Using LIKE instead of ILIKE because case insensitivity is really expensive: down to 940 ms
  • Using COLLATE "C" 1900 ms

Still prohibitively slow when scaled x16. A search query with a table scan is pointless.

The drvPath case

If the first query completes, it'll perform a similar query on the Builds table (now looking for a derivation path). This query seems okay in terms of performance, but I still don't think it's the way to go.

That query works because there's a trigram index on Builds.drvPath, which works with unanchored ILIKEs as well:

    -- Provide an index used by LIKE operator on builds.drvpath (search query)
    create index IndexTrgmBuildsOnDrvpath on builds using gin (drvpath gin_trgm_ops);

As for why not:

test=# \d+
                                         List of relations
 Schema |     Name      |   Type   | Owner | Persistence | Access method |    Size    | Description 
--------+---------------+----------+-------+-------------+---------------+------------+-------------
 public | buildoutputs  | table    | test  | permanent   | heap          | 1250 MB    | 
 public | builds        | table    | test  | permanent   | heap          | 1289 MB    | 

test=# \di+
                                                     List of relations
 Schema |            Name             | Type  | Owner |    Table     | Persistence | Access method |  Size   | Descrip 
--------+-----------------------------+-------+-------+--------------+-------------+---------------+---------+---------
 public | buildoutputs_pkey           | index | test  | buildoutputs | permanent   | btree         | 483 MB  | 
 public | builds_pkey                 | index | test  | builds       | permanent   | btree         | 214 MB  | 
 public | indexbuildoutputsonpath     | index | test  | buildoutputs | permanent   | btree         | 925 MB  | 
 public | indexbuildoutputspath       | index | test  | buildoutputs | permanent   | hash          | 256 MB  | 
 public | indextrgmbuildsondrvpath    | index | test  | builds       | permanent   | gin           | 1973 MB | 

.. it's twice the size of my table, and about 8x larger than a simple hash index

The fact that it works is an interesting hack, but indexing the full path is a waste of space because there's no use for trigram matching on the nix hash part of the path.

If anything, such an index might be useful on the nixName?

The obvious solution

Both queries seem to query the store path (drv or output). I cannot think of a reason why you wouldn't limit the search to exact matches only. Extract the hash if needed (to support searching for the full store path as well), lowercase it and then use a index expression for search.

After coming up with this, I took a look at the Hydra history and found some related PRs: the ILIKE (and thus slowness) was introduced in https://github.com/NixOS/hydra/pull/654 - the rationale being exactly what I've described, though it turned a nice exact match (should be fast) into this seq. scan mess.

Conclusion

I suppose the way to go is to reimplement it as described above, for example:

CREATE INDEX substr ON Buildoutputs USING HASH (substring(path, 12, 32));

test=# EXPLAIN ANALYZE SELECT * FROM buildoutputs WHERE substring(path, 12, 32) = 'ilprfhnzsq04h656b5f3vcsvp4ddv26x';
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Bitmap Heap Scan on buildoutputs  (cost=1043.50..107297.78 rows=50000 width=98) (actual time=4.639..4.641 rows=1 loops=1)
   Recheck Cond: ("substring"(path, 12, 32) = 'ilprfhnzsq04h656b5f3vcsvp4ddv26x'::text)
   Heap Blocks: exact=1
   ->  Bitmap Index Scan on substr  (cost=0.00..1031.00 rows=50000 width=0) (actual time=0.012..0.012 rows=1 loops=1)
         Index Cond: ("substring"(path, 12, 32) = 'ilprfhnzsq04h656b5f3vcsvp4ddv26x'::text)
 Planning Time: 0.146 ms
 JIT:
   Functions: 2
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 0.897 ms, Inlining 0.000 ms, Optimization 0.420 ms, Emission 4.112 ms, Total 5.429 ms
 Execution Time: 5.595 ms
(11 rows)

Time: 7,266 ms

The drawback is that you can no longer abuse (?) searching the build path for queries on the name. Might make sense to index the name as well for searches that do not contain a hash (note it's a lot faster for longer search strings):

CREATE INDEX IndexTrgmBuildsOnName2 ON buildoutputs USING GIN (name gin_trgm_ops);

# EXPLAIN ANALYZE SELECT * FROM buildoutputs WHERE name LIKE '%lib%';
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Gather  (cost=5609.08..257328.80 rows=498721 width=98) (actual time=90.934..926.061 rows=602953 loops=1)
   Workers Planned: 2
   Workers Launched: 2
   ->  Parallel Bitmap Heap Scan on buildoutputs  (cost=4609.08..206456.70 rows=207800 width=98) (actual time=66.996..860.041 rows=200984 loops=3)
         Recheck Cond: (name ~~ '%lib%'::text)
         Rows Removed by Index Recheck: 1936034
         Heap Blocks: exact=17645 lossy=32877
         ->  Bitmap Index Scan on indextrgmbuildsonname  (cost=0.00..4484.40 rows=498721 width=0) (actual time=77.648..77.649 rows=620273 loops=1)
               Index Cond: (name ~~ '%lib%'::text)
 Planning Time: 0.668 ms
 JIT:
   Functions: 6
   Options: Inlining false, Optimization false, Expressions true, Deforming true
   Timing: Generation 1.345 ms, Inlining 0.000 ms, Optimization 0.569 ms, Emission 5.682 ms, Total 7.596 ms
 Execution Time: 944.716 ms
(15 rows)

Time: 946,846 ms

Note that this index is 1/8th the size of a trigram index on the full nix path (at leasy for me).

Closing remarks

Having written all of this, I noticed all queries performed for search use similar ILIKES, though the tables look small enough for it not to matter. Someone might want to check..

I don't have a Hydra instance to test on and am not familiar with Perl, so I wasn't planning on submitting a PR.. doesn't look like too much work for someone who is?

EDIT: Related notes by @grahamc https://github.com/NixOS/hydra/pull/853

ius avatar Jan 11 '22 16:01 ius

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/how-to-get-a-nixos-image-with-newer-kernel/17334/6

nixos-discourse avatar Jan 25 '22 19:01 nixos-discourse