timescaledb-toolkit icon indicating copy to clipboard operation
timescaledb-toolkit copied to clipboard

Default collation does not work on Alpine

Open joostlammers opened this issue 4 years ago • 13 comments

Relevant system information:

  • "PostgreSQL 12.7 on x86_64-pc-linux-musl, compiled by gcc (Alpine 10.2.1_pre1) 10.2.1 20201203, 64-bit"
  • timescaledev/timescale-analytics:nightly
  • Running on a Kubernetes cluster

Describe the bug Getting the rows from our timescale db continuous Materialized view, we keep getting "ERROR: deserialization error invalid collation "pg_catalog"."en_US.utf8" CONTEXT: extension/src/hyperloglog.rs:106:5 SQL state: XX000"

To Reproduce Steps to reproduce the behavior:

  1. Create table:

CREATE TABLE some_table ( time TIMESTAMP WITH TIME ZONE NOT NULL, countryCode VARCHAR(2), clientIp VARCHAR(40) );

  1. Create Materialized view:

CREATE MATERIALIZED VIEW view_name WITH (timescaledb.continuous) AS SELECT time_bucket('01:00:00'::interval, some_table."time") AS "time", some_table.clientip, timescale_analytics_experimental.hyperloglog_count(timescale_analytics_experimental.hyperloglog(64, some_table.countryCode)) AS countrycodes, count(*) AS clientcalls FROM some_table GROUP BY (time_bucket('01:00:00'::interval, some_table."time")), some_table.countrycode, some_table.clientip;`

  1. Add some data to the table
  2. Try to grab some data from this view

SELECT * FROM public.view_name

  1. Error appears

Additional information: Using PGAdmin to retrieve the code of the view and executing it partially, it appears it's caused within:

_timescaledb_internal.finalize_agg('timescale_analytics_experimental.hyperloglog(integer,anyelement)'::text, 'pg_catalog'::name, 'default'::name, '{{pg_catalog,int4},{pg_catalog,varchar}}'::name[], _materialized_hypertable_6.agg_4_4, NULL::timescale_analytics_experimental.hyperloglog) AS countrycodes,

just before the inner-join.

joostlammers avatar Jun 03 '21 14:06 joostlammers

Thank you for the excellent bug report!

I was able to reproduce the bug locally, and will report back one we've determined the cause. (note for future repro'ers: you need to refresh the continuous aggregate once the data has been inserted to trigger the bug)

JLockerman avatar Jun 03 '21 16:06 JLockerman

Simpler repro of the same issue: install the extension and run

SELECT timescale_analytics_experimental.hyperloglog_count('{
  "version":1,
  "element_type":"VARCHAR",
  "collation":["pg_catalog","en_US.utf8"],
  "b":6,
  "registers":[0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3,0,0]
}');

This will cause an error like

ERROR:  called `Result::unwrap()` on an `Err` value: Error("invalid collation \"pg_catalog\".\"en_US.utf8\"", line: 4, column: 41)
LINE 1: SELECT timescale_analytics_experimental.hyperloglog_count('{
                                                                  ^
CONTEXT:  extension/src/hyperloglog.rs:130:1

which I believe is caused by the same underlying issue

JLockerman avatar Jun 03 '21 16:06 JLockerman

Interestingly the default collation for the database is en_US.utf8

postgres=# \l
                                 List of databases
   Name    |  Owner   | Encoding |  Collate   |   Ctype    |   Access privileges   
-----------+----------+----------+------------+------------+-----------------------
 postgres  | postgres | UTF8     | en_US.utf8 | en_US.utf8 | 
 template0 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
 template1 | postgres | UTF8     | en_US.utf8 | en_US.utf8 | =c/postgres          +
           |          |          |            |            | postgres=CTc/postgres
(3 rows)

but when I try to create a column collated on it

CREATE TABLE test (foo text collate "en_US.utf8");

I get an error

ERROR:  collation "en_US.utf8" for encoding "UTF8" does not exist
LINE 1: create table test (foo text collate "en_US.utf8");

Next I'm going to check if this problem exists in the upstream image out nightlies are based on.

JLockerman avatar Jun 03 '21 17:06 JLockerman

It looks this issue is present in the current timescale/timescaledb:latest-pg12. I will try to open an error report there.

Error Summary

For stability across machines in multinode, which may have different OIDs, we serialize collations as a (namespace, name) pair. The default collation (OID 100), is special, as it does not refer to a "real" collation but the default for the database, and since this may differ across databases, we look in the database catalog to discover what the collation actually is, and serialize that. Unfortunately, in the current timescale/timescaledb:latest-pg12 the default database collation is set to en_US.utf8, a collation not supported by that version of the database, so when we try to deserialize this collation the DB complains that we're trying to use a collation it does not know about. For now we have two takeaways:

  1. We should fix this in the upstream TimescaleDB image, it shouldn't be defaulting to a collation that does not exist.
  2. We need to determine if we can detect and handle this edge case when serializing collations.

JLockerman avatar Jun 03 '21 17:06 JLockerman

Thanks @JLockerman for the quick and good responses, is there a manual workaround that we can apply for now?

janfockaert avatar Jun 03 '21 18:06 janfockaert

For a short-term fix I'm changing the nightly image to be based on our debian image in PR https://github.com/timescale/timescale-analytics/pull/149; this will make the collation things work, and make the nightly image more similar to the release image, and switch to a more-tested code paths, hopefully preventing other issues in the future.

Longer term, we're planning to switch our HLL implementation for an HLL++ implementation, and while we doing that we'll add some detection for this case (probably by ignoring the default collation a treating it like the C collation; it looks like Postgres guarantees that the default collation will be byte-wise compatible, and text_hash() ignores the default collation anyway).

JLockerman avatar Jun 03 '21 23:06 JLockerman

@janfockaert switching to a different collation for the hyperloglog should work in the meantime, for instance

timescale_analytics_experimental.hyperloglog(buckets, data COLLATE "C")

JLockerman avatar Jun 03 '21 23:06 JLockerman

Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?

janfockaert avatar Jun 04 '21 18:06 janfockaert

note: it probably pays to detect the input collation and if it's deterministic (or default?) just use C.

JLockerman avatar Jun 07 '21 16:06 JLockerman

Thx, the workaround works as expected 👍 The new nightly build is not updated yet, was the build broken maybe?

Nope CI builds are currently broken. It works locally so I pushed a manual build.

JLockerman avatar Jun 07 '21 16:06 JLockerman

Nightly builds should be fixed by PR https://github.com/timescale/timescale-analytics/pull/154

JLockerman avatar Jun 07 '21 21:06 JLockerman

Is it fixed ? or not I have same error

bikashthapacoding avatar Mar 23 '22 07:03 bikashthapacoding

latest docker image produces the error on schema from TimescaleDb getting started page:

select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
select distinct_count(hyperloglog((2^18) :: int, city_name)) from weather_metrics;
ERROR:  deserialization error invalid collation "pg_catalog"."C.UTF-8"
CONTEXT:  extension/src/hyperloglog.rs:126:31

yaitskov avatar May 03 '22 16:05 yaitskov