pg_squeeze icon indicating copy to clipboard operation
pg_squeeze copied to clipboard

Final table swap fails with `cannot fetch toast data without an active snapshot` error

Open airhorns opened this issue 9 months ago • 6 comments

I'm trying to squeeze a large table with a lot of bloat. The worker seems to make it almost all the way through the squeeze process -- from squeeze.get_active_workers() I see ins_initial slowly rise, then ins / upd / del rise quickly, and but then a failure where the original table remains in place. In squeeze.errors, I see a row like this:

id occurred tabschema tabname sql_state err_msg err_detail
7 2025-03-19 12:43:16.197313+00 public app_background_actions XX000 cannot fetch toast data without an active snapshot

We've successfully squeezed other tables on this database. I've tried a few times at different times of day and with less and less active load on the table, but I've gotten the same error each time.

Here's the other system stats:

  • PostgreSQL 13.18 on x86_64-pc-linux-gnu, compiled by gcc (GCC) 8.5.0 20210514 (Red Hat 8.5.0-22), 64-bit
  • Running on Cruchybridge in GCP
  • pg_squeeze version v1.6 from pg_extension:
oid extname extowner extnamespace extrelocatable extversion extconfig extcondition
4064493083 pg_squeeze 10 4064493081 0 1.6 NULL NULL
  • other active logical replication slots on this database
  • default replica identity for this table:
SELECT CASE relreplident
          WHEN 'd' THEN 'default'
          WHEN 'n' THEN 'nothing'
          WHEN 'f' THEN 'full'
          WHEN 'i' THEN 'index'
       END AS replica_identity
FROM pg_class
WHERE oid = 'app_background_actions'::regclass;
replica_identity
default
  • I don't see any errors in the database log from around the time of the swap, but I am not exactly sure what I am looking for.

I'm not quite sure what other debugging information to share -- I know this isn't very much. Let me know and I can try to collect it to understand what's happening!

airhorns avatar Mar 19 '25 14:03 airhorns

Origin of error: https://github.com/postgres/postgres/blob/474aee3dfe8860994893959ef1566c51d3304aa3/src/backend/access/common/toast_internals.c#L646

NikolayS avatar Mar 19 '25 14:03 NikolayS

Why do you think the error happens during the swap phase? Does server log indicate so? In such a case, the only idea I have so far is that pg_class entry for your table has some column (relacl, reloptions, relpartbound) TOASTed. It's unusual, but seems possible. Can you please check?

ahouska avatar Mar 19 '25 16:03 ahouska

Why do you think the error happens during the swap phase? Does server log indicate so?

No hard evidence, just guessing based on watching the entry in the squeeze.get_active_workers() go through the normal phases. In the ~7 attempts I've ran, I always see ins_initial rise to about the row count of the table, with the other three values remaining 0, then them all rise, then the error. Is there a more verbose log I can access to share somehow?

In such a case, the only idea I have so far is that pg_class entry for your table has some column (relacl, reloptions, relpartbound) TOASTed. It's unusual, but seems possible. Can you please check?

We have a massive postgres catalog, so something like this might be happening. I did my best to check if any columns of pg_class are toasted for this table with this:

SELECT c.relname, 
       a.attname,
       pg_column_size(c.*) AS total_size,
       pg_column_size(c.relname) AS relname_size,
       pg_column_size(c.relacl) AS relacl_size,
       pg_column_size(c.reloptions) AS reloptions_size,
       pg_column_size(c.relpartbound) AS relpartbound_size
FROM pg_class c
JOIN pg_attribute a ON a.attrelid = 'pg_class'::regclass
WHERE c.relname = 'app_background_actions'
AND a.attnum > 0
AND a.attisdropped = false
ORDER BY a.attnum;

and got this out, which suggests that none are I think:

relname attname total_size relname_size relacl_size reloptions_size relpartbound_size
app_background_actions oid 452 64 93 184 NULL
app_background_actions relname 452 64 93 184 NULL
app_background_actions relnamespace 452 64 93 184 NULL
app_background_actions reltype 452 64 93 184 NULL
app_background_actions reloftype 452 64 93 184 NULL
app_background_actions relowner 452 64 93 184 NULL
app_background_actions relam 452 64 93 184 NULL
app_background_actions relfilenode 452 64 93 184 NULL
app_background_actions reltablespace 452 64 93 184 NULL
app_background_actions relpages 452 64 93 184 NULL
app_background_actions reltuples 452 64 93 184 NULL
app_background_actions relallvisible 452 64 93 184 NULL
app_background_actions reltoastrelid 452 64 93 184 NULL
app_background_actions relhasindex 452 64 93 184 NULL
app_background_actions relisshared 452 64 93 184 NULL
app_background_actions relpersistence 452 64 93 184 NULL
app_background_actions relkind 452 64 93 184 NULL
app_background_actions relnatts 452 64 93 184 NULL
app_background_actions relchecks 452 64 93 184 NULL
app_background_actions relhasrules 452 64 93 184 NULL
app_background_actions relhastriggers 452 64 93 184 NULL
app_background_actions relhassubclass 452 64 93 184 NULL
app_background_actions relrowsecurity 452 64 93 184 NULL
app_background_actions relforcerowsecurity 452 64 93 184 NULL
app_background_actions relispopulated 452 64 93 184 NULL
app_background_actions relreplident 452 64 93 184 NULL
app_background_actions relispartition 452 64 93 184 NULL
app_background_actions relrewrite 452 64 93 184 NULL
app_background_actions relfrozenxid 452 64 93 184 NULL
app_background_actions relminmxid 452 64 93 184 NULL
app_background_actions relacl 452 64 93 184 NULL
app_background_actions reloptions 452 64 93 184 NULL
app_background_actions relpartbound 452 64 93 184 NULL

airhorns avatar Mar 19 '25 18:03 airhorns

Hm, pg_class actually has no TOAST relation.

Can you please dump the table (without data), restore it in another database and try if the squeezing completes? Failure here would reveal that there is something specific about the table definition.

If that works, can you please get the backtrace of the error? (The PID to attach to can be retrieved from get_active_workers(), unless the worker exits too early.) Thanks.

ahouska avatar Mar 20 '25 11:03 ahouska

This is rather a wild idea, anyway: I wonder if you managed to "accidentally" add TOAST table to some system catalog. It's probably not very likely, but might be worth checking. Can you please share the output of this query?

select count(*) from pg_class where oid < 16384 and reltoastrelid > 0;

ahouska avatar Apr 02 '25 08:04 ahouska

The count returns 38 rows, here's the raw result:

oid relname relnamespace reltype reloftype relowner relam relfilenode reltablespace relpages reltuples relallvisible reltoastrelid relhasindex relisshared relpersistence relkind relnatts relchecks relhasrules relhastriggers relhassubclass relrowsecurity relforcerowsecurity relispopulated relreplident relispartition relrewrite relfrozenxid relminmxid relacl reloptions relpartbound
826 pg_default_acl 11 12042 0 10 2 826 0 1 20 1 4143 1 0 p r 5 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1213 pg_tablespace 11 12027 0 10 2 0 1664 1 2 1 4185 1 1 p r 5 0 0 0 0 0 0 1 n 0 0 257483384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1247 pg_type 11 71 0 10 2 0 0 42415 862704 42415 4171 1 0 p r 31 0 0 0 0 0 0 1 n 0 0 257520402 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1255 pg_proc 11 81 0 10 2 0 0 99 3270 99 2836 1 0 p r 29 0 0 0 0 0 0 1 n 0 0 257483384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1260 pg_authid 11 2842 0 10 2 0 1664 2 33 2 4175 1 1 p r 12 0 0 0 0 0 0 1 n 0 0 257260192 305707150 {postgres=arwdDxt/postgres} NULL NULL
1262 pg_database 11 1248 0 10 2 0 1664 1 5 1 4177 1 1 p r 14 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1417 pg_foreign_server 11 12037 0 10 2 1417 0 0 0 0 4151 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
1418 pg_user_mapping 11 12038 0 10 2 1418 0 0 0 0 4173 1 0 p r 4 0 0 0 0 0 0 1 n 0 0 257520402 305707158 {postgres=arwdDxt/postgres} NULL NULL
2328 pg_foreign_data_wrapper 11 12036 0 10 2 2328 0 1 2 1 4149 1 0 p r 7 0 0 0 0 0 0 1 n 0 0 257270919 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2396 pg_shdescription 11 12029 0 10 2 0 1664 1 3 1 2846 1 1 p r 3 0 0 0 0 0 0 1 n 0 0 257493699 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2600 pg_aggregate 11 12013 0 10 2 2600 0 2 141 2 4159 1 0 p r 22 0 0 0 0 0 0 1 n 0 0 257378384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2604 pg_attrdef 11 12000 0 10 2 2604 0 42435 711242 42435 2830 1 0 p r 4 0 0 0 0 0 0 1 n 0 0 257427665 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2606 pg_constraint 11 12001 0 10 2 2606 0 6130 181951 6130 2832 1 0 p r 25 0 0 0 0 0 0 1 n 0 0 257427665 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2609 pg_description 11 12020 0 10 2 2609 0 42 4844 42 2834 1 0 p r 4 0 0 0 0 0 0 1 n 0 0 257427665 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2612 pg_language 11 12010 0 10 2 2612 0 1 4 1 4157 1 0 p r 9 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2615 pg_namespace 11 12023 0 10 2 2615 0 14 1083 14 4163 1 0 p r 4 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2618 pg_rewrite 11 12017 0 10 2 2618 0 13 136 13 2838 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257483384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2619 pg_statistic 11 12016 0 10 2 2619 0 211831 2188087 211417 2840 1 0 p r 31 0 0 0 0 0 0 1 n 0 0 159478976 305695944 {postgres=arwdDxt/postgres} NULL NULL
2620 pg_trigger 11 12018 0 10 2 2620 0 6 262 6 2336 1 0 p r 19 0 0 0 0 0 0 1 n 0 0 257493699 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
2964 pg_db_role_setting 11 12026 0 10 2 0 1664 1 14 1 2966 1 1 p r 3 0 0 0 0 0 0 1 n 0 0 257260192 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3079 pg_extension 11 12035 0 10 2 3079 0 1 11 1 4147 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257270919 305707152 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3118 pg_foreign_table 11 12039 0 10 2 3118 0 0 0 0 4153 1 0 p r 3 0 0 0 0 0 0 1 n 0 0 257280233 305707152 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3256 pg_policy 11 12040 0 10 2 3256 0 0 0 0 4167 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257493699 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3350 pg_partitioned_table 11 12046 0 10 2 3350 0 0 0 0 4165 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257472497 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3381 pg_statistic_ext 11 12014 0 10 2 3381 0 1 19 1 3439 1 0 p r 8 0 0 0 0 0 0 1 n 0 0 257483384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3394 pg_init_privs 11 12043 0 10 2 3394 0 3 193 3 4155 1 0 p r 5 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3429 pg_statistic_ext_data 11 12015 0 10 2 3429 0 1 19 1 3430 1 0 p r 4 0 0 0 0 0 0 1 n 0 0 257460173 305707158 {postgres=arwdDxt/postgres} NULL NULL
3456 pg_collation 11 12045 0 10 2 3456 0 25 779 25 4161 1 0 p r 10 0 0 0 0 0 0 1 n 0 0 257260192 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3466 pg_event_trigger 11 12019 0 10 2 3466 0 0 0 0 4145 1 0 p r 7 0 0 0 0 0 0 1 n 0 0 257148318 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3592 pg_shseclabel 11 4066 0 10 2 0 1664 0 0 0 4060 1 1 p r 4 0 0 0 0 0 0 1 n 0 0 257472497 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3596 pg_seclabel 11 12044 0 10 2 3596 0 0 0 0 3598 1 0 p r 5 0 0 0 0 0 0 1 n 0 0 257101805 305707150 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
3600 pg_ts_dict 11 12032 0 10 2 3600 0 1 23 1 4169 1 0 p r 6 0 0 0 0 0 0 1 n 0 0 257520402 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
6000 pg_replication_origin 11 12041 0 10 2 0 1664 0 0 0 4181 1 1 p r 2 0 0 0 0 0 0 1 n 0 0 257483384 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
6100 pg_subscription 11 6101 0 10 2 0 1664 0 0 0 4183 1 1 p r 9 0 0 0 0 0 0 1 n 0 0 257472497 305707158 {postgres=arwdDxt/postgres} NULL NULL
13284 sql_features 13150 13285 0 10 2 13284 0 8 712 8 13286 0 0 p r 7 0 0 0 0 0 0 1 d 0 0 257571453 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
13289 sql_implementation_info 13150 13290 0 10 2 13289 0 1 12 1 13291 0 0 p r 5 0 0 0 0 0 0 1 d 0 0 257581998 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL
13294 sql_parts 13150 13295 0 10 2 13294 0 1 10 1 13296 0 0 p r 5 0 0 0 0 0 0 1 d 0 0 257559073 305707158 NULL NULL NULL
13299 sql_sizing 13150 13300 0 10 2 13299 0 1 23 1 13301 0 0 p r 4 0 0 0 0 0 0 1 d 0 0 257565538 305707158 {postgres=arwdDxt/postgres,=r/postgres} NULL NULL

Also, we ended up fighting through the pain without squeezing and have now dropped the problematic table in production, so our actual production issue has been solved, but I've also lost the ability to reproduce this and debug, apologies.

airhorns avatar Apr 02 '25 13:04 airhorns