helpdesk [uplink] Download failing for `JavaSystemProperties` with `error: missing chunk number 0 for toast value xx in pg_toast

Service(s)

Uplink

Summary

As reported by @daniel-beck in IM discussion (ref. https://matrix.to/#/!JLUOInpEYmxJIYXlzs:matrix.org/$VlAjMDeZOPjLhMwkevPtM7P1zklTyH_wvJ9eXfpkHaE?via=g4v.dev&via=gitter.im&via=matrix.org):

Hi folks, I am unable to download the JavaSystemProperties telemetry trial data from uplink.jenkins.io, downloads abort. Other trials' data can be downloaded. Do we have data(base) problems? Anything interesting being logged? (...) Specifically, any date range including Dec 24 fails. Earlier and later work. (...) Correction, 2023-12-31 also fails.

Reproduction steps

With an authorized GitHub account, log-in to uplink.jenkins.io
Open the logs in "follow" mode on publick8s (kubectl -n uplink logs -l 'app.kubernetes.io/name=uplink' -f)
Select the jenkins.telemetry.impl.JavaSystemProperties type and the 2023-12-24 to 2023-12-25 time range and click "Export"

Jan 02 '24 13:01 dduportal

Reproduced the error:

The download stop on client side with an error
The logs shows the following error message:

error: missing chunk number 0 for toast value 21681973 in pg_toast_17895
    at Connection.parseE (/srv/uplink/node_modules/pg/lib/connection.js:553:11)
    at Connection.parseMessage (/srv/uplink/node_modules/pg/lib/connection.js:378:19)
    at Socket.<anonymous> (/srv/uplink/node_modules/pg/lib/connection.js:119:22)
    at Socket.emit (events.js:180:13)
    at Socket.emit (domain.js:422:20)
    at addChunk (_stream_readable.js:274:12)
    at readableAddChunk (_stream_readable.js:261:11)
    at Socket.Readable.push (_stream_readable.js:218:10)
    at TCP.onread (net.js:581:20)

Jan 02 '24 13:01 dduportal

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

Jan 08 '24 11:01 dduportal

Triggered a REINDEX of the table events (will take some time and resources) from a screen named helpdesk-3883 on the VPN private machine)

Update:

REINDEX table events; finished ith success (step 1/3) after ~6 hours
Started the pg_toast re-index now (step 2/3):

> select reltoastrelid::regclass from pg_class where relname = 'events';
      reltoastrelid      
-------------------------
 pg_toast.pg_toast_17895
> REINDEX table pg_toast.pg_toast_17895;

TODO:

Run an anlayse of possible cleanups on the table events with VACUUM analyze events; (step 3/3)
Check if it works otherwise run the same steps on the 4 other tables

Jan 08 '24 18:01 dduportal

Update:

Started the VACUUM analyze events; step (in progress)

Jan 09 '24 06:01 dduportal

Update:

VACUUM did fix some index. Now the download goes further but still fails.
Following steps from https://gist.github.com/supix/80f9a6111dc954cf38ee99b9dedf187a to indentify the corrupted segment

Jan 09 '24 11:01 dduportal

Update: still trying to find the corrupted table segment

Jan 10 '24 13:01 dduportal

Update:

> select reltoastrelid::regclass from pg_class where relname = 'events';
reltoastrelid      
-------------------------
pg_toast.pg_toast_17895

> select count(*) from events;
  count
----------
 34673551
(1 row)

# Some retries
> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

=> now to use dichotomy to locate the failing segment

Jan 11 '24 07:01 dduportal

Update: dichotomy

> select * from events order by id limit 5000000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2500000 offset 30000000;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1250000 offset 30000000; # 30000000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 625000 offset 30000000; # 30000000 -> 30625000
(no error)


> select * from events order by id limit 625000 offset 30625000; # 30625000 -> 31250000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 312500 offset 30625000; # 30625000 -> 30937500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

Jan 11 '24 20:01 dduportal

Update:

> select * from events order by id limit 156250 offset 30625000; # 30625000 -> 30781250
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 35000 offset 30625000; # 30625000 -> 30660000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10000 offset 30625000; # 30625000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5000 offset 30625000; # 30625000 -> 30630000
No error

Almost there!

Jan 16 '24 17:01 dduportal

Update:

> select * from events order by id limit 5000 offset 30630000; # 30630000 -> 30635000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895


> select * from events order by id limit 2500 offset 30630000; # 30630000 -> 30632500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1000 offset 30630000; # 30630000 -> 30631000
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 500 offset 30630000; # 30630000 -> 30630500
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 100 offset 30630000; # 30630000 -> 30630100
No error

> select * from events order by id limit 100 offset 30630100; # 30630100 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
# In progress

Jan 18 '24 08:01 dduportal

Finally!

> select * from events order by id limit 50 offset 30630100; # 30630100 -> 30630150
No error

> select * from events order by id limit 50 offset 30630150; # 30630150 -> 30630200
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 25 offset 30630150; # 30630150 -> 30630175
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 10 offset 30630150; # 30630150 -> 30630160
No error

> select * from events order by id limit 10 offset 30630160; # 30630160 -> 30630170
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 5 offset 30630160; # 30630160 -> 30630165
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 2 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

> select * from events order by id limit 1 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019672 in pg_toast_17895

Jan 19 '24 09:01 dduportal

Update:

Removed the line detected as corrupted:

> select id from events order by id limit 1 offset 30630160;
id
----------
 30648884
(1 row)

> delete from events where id = 30648884;
DELETE 1
uplink=> select id from events order by id limit 1 offset 30630160;
    id
----------
 30648885
(1 row)

The download goes further until 3Mb (I was blocked around 1.4Mb) but still errors. Let's check for other corrupted entries

Jan 19 '24 13:01 dduportal

Here we go again, let's fin the other corrupted records (same screen session, same VM):

> select * from events order by id limit 78125 offset 30625000; # 30625000 -> 30703125
# WiP

Jan 23 '24 09:01 dduportal

Deleted a few more (corrupted) records and ran a VACUUM:

> delete from events where id = 30648885;
DELETE 1
> delete from events where id = 30648886;
DELETE 1
> delete from events where id = 30648887;
DELETE 1
> delete from events where id = 30648888;
DELETE 1

> VACUUM analyze events;

=> download goes further but still errors. Had to reboot the machine (and loose the screen session) so the research of other corrupted records will have to wait after FOSDEM.

Jan 29 '24 07:01 dduportal

FWIW this particular data collection is now (mostly) finished. It was how I got file.encoding stats for the advisory. The other system properties might be interesting but not an immediate concern.

Since it's not just one record it might be interesting to find out what's happening here though, to prevent it from occurring in the future.

Jan 29 '24 09:01 daniel-beck

Reseach for corrupted records still running

Feb 16 '24 13:02 dduportal

Update:

Found another corrupted record and deleted it:

uplink=> select * from events order by id limit 1 offset 30630160;
ERROR:  missing chunk number 0 for toast value 19019693 in pg_toast_17895
uplink=> select id from events order by id limit 1 offset 30630160;
   id
----------
30648910
(1 row)

uplink=> delete from events where id = 30648910;
DELETE 1

However, a new error message appears:

select * from events order by id limit 1 offset 30630160;
ERROR:  unexpected chunk number 1 (expected 0) for toast value 19019698 in pg_toast_17895

Started a new reindex + vacuum yesterday around 18:16: REINDEX table events; REINDEX table pg_toast.pg_toast_17895; VACUUM analyze events; trying to fix this new error (by regenerating the toast index after records deletion). This taks 10 to 14 hours usually.
But today we had an incident on uplink.jenkins.io:
- Service was up but once logged-in (using GitHub) then a HTTP/500 error with a "Ooooops"' messages was shown
- Both pods had errors logs with error: TimeoutError: ResourceRequest timed out since Feb 19, 2024 at 18:18:33.264 (as per datadog collected logs) => this sounds like a cause of the reindex proces
- Deleting pods only led to multiple table migrations during each pod startup... concurrently 😱
- uplink has always been replicated (at least since https://github.com/jenkins-infra/kubernetes-management/commit/49f1e465ef3a11c01d7efe0d89f5692eb52093c7) but we decided to stop replication to ensure migration of the view during startups are not run concurently given the huge amount of data: https://github.com/jenkins-infra/kubernetes-management/pull/4999
- We also edited manually the deployment definition to remove both readiness and liveness startup probes: the migrations tasks such as https://github.com/jenkins-infra/uplink/blob/main/migrations/20181026220312-first-types-materialized-view.js takes 5 to 15 minutes to be run: plenty of time for the kubelet to restart the process as it's seen "unhealthy" => we have to add startup probes to protect us from this otherwise we'll have locked startup - https://kubernetes.io/docs/tasks/configure-pod-container/configure-liveness-readiness-startup-probes/#define-startup-probes
- Until we've fixed the corrupted data, we'll stay using 1 replica

Feb 20 '24 13:02 dduportal

Of course:

> REINDEX table pg_toast.pg_toast_17895;
ERROR:  permission denied for schema pg_toast

because Azure Flexible Server does NOT provide super user access.

Gotta check https://learn.microsoft.com/en-us/azure/postgresql/flexible-server/how-to-perform-fullvacuum-pg-repack

Feb 20 '24 17:02 dduportal

Closing as it is not needed anymore and the problem does not reproduces.

Nov 07 '24 17:11 dduportal

[uplink] Download failing for `JavaSystemProperties` with `error: missing chunk number 0 for toast value xx in pg_toast_xxx`

Service(s)

Summary

Reproduction steps