sgr icon indicating copy to clipboard operation
sgr copied to clipboard

`sgr commit` fails on tables with just one jsonb column

Open ltrgoddard opened this issue 3 years ago • 2 comments

Hi folks,

I noticed that sgr fails when trying to commit a table with the following structure—uk_company_data_test.psc(data jsonb). The output below shows three attempts to commit the table—first with just the jsonb column, second with an all-null integer column called id added, and third with the first few rows of id filled in. This last attempt seemed succesful, but when it was pushed to the DDN, it became apparent that only the rows with values for id had in fact been committed/pushed (rows with null were skipped completely).

I'm guessing this is something to do with sorting? Not a big deal as it's generally good practice to have an ID column in a JSONB table anyway, but wanted to flag!

louis@quwain ~ % sgr init uk_company_data_test
Initialized empty repository uk_company_data_test
louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Processing table psc
error: psycopg2.errors.SyntaxError: syntax error at or near ")"
error: LINE 1: ..._row_nums AS (SELECT(ROW_NUMBER() OVER (ORDER BY ()::text) -...
error:                                                              ^

louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Object o058e47b8dc5a66b16a81c8c75279982bacf62dc3bfc309b1615c31c1894dbf already exists, skipping
Processing table psc
error: TypeError: reduce() of empty sequence with no initial value

louis@quwain ~ % sgr commit uk_company_data_test
Committing uk_company_data_test...
Processing table companies
Object o058e47b8dc5a66b16a81c8c75279982bacf62dc3bfc309b1615c31c1894dbf already exists, skipping
Processing table psc
Committed uk_company_data_test as 1c5e6165edb6.

ltrgoddard avatar Dec 22 '21 23:12 ltrgoddard

Thanks for flagging! You've hit a corner case with sgr: it uses the primary key to partition tables and track overwrites when versioning. If the key doesn't exist, it assumes a surrogate PK made out of the whole row (as text), but ignores some certain types that wouldn't make sense as a PK (like jsonb, geodata, ...).

In your first example, it tried doing that and failed because there were no columns that could be a PK or a surrogate PK, which isn't handled at all and so it crashed. In the second example, it used id::text as the PK and all rows that were NULLs were treated as overwriting each other, so only one NULL effectively ended up in the dataset.

This no-PK case is often handler pretty poorly so we're considering dropping it entirely and adding an autogenerated PK (like a hash of the whole row / row number like you did) automatically in sgr commit and actually storing it.

One other solution would be to "normalize" the JSON into a table like psc(kind varchar, name varchar, links.self varchar, address.country varchar, address.locality varchar, ...). This might also mean that the dataset will be physically smaller (leverage columnar storage and compress better) and result in faster scans (since we won't need to read the entire JSON doc if you're only aggregating over e.g. name). You can do it either at data load time, after the PG table load but before the sgr commit or on Splitgraph Cloud itself (we have a small example of how to do it with dbt at https://github.com/splitgraph/dbt-transform-example)

mildbyte avatar Dec 23 '21 11:12 mildbyte

That makes sense—thanks for explaining the logic behind it. I did think after I'd loaded this particular data set that jsonb is a poor choice for getting the best out of cstore_fdw. I'll do a cut-down, normalised version for this particular project!

ltrgoddard avatar Dec 23 '21 15:12 ltrgoddard