Inserting a table with `Nothing` values into the Database fails in a confusing way if defaults are used
I've just been trying to do some unrelated performance measurements and I've lost ~15 minutes investigating a weird and unexpected message:
There was an SQL error: NULL result in a non-nullable column. [Query was: INSERT INTO "My_Big_Table" ("X") VALUES (?)]
It turns out the problem is the following - by default the select_into_database_table uses the first column as a PRIMARY KEY for the created table. This imposes an implicit NOT-NULL constraint on the column. But there is no guarantee that the column may contain nulls - so if it does contain them the whole upload operation fails.
This occurs in Postgres and Snowflake (did not test others but I assume it is the same). Small example:
> c = connect_to_postgres
> (Table.new [["X", [1, Nothing, 2]]]).select_into_database_table c "my-table1" temporary=True
>>> (Error: There was an SQL error: Batch entry 1 INSERT INTO "my-table1" ("X") VALUES (NULL) was aborted: ERROR: null value in column "X" of relation "my-table1" violates not-null constraint
Szczegóły: Failing row contains (null). Call getNextException to see other errors in the batch.. [Query was: INSERT INTO "my-table1" ("X") VALUES (?)])
I think that defaults yielding to errors on relatively simple tables mean that the defaults are wrong.
My proposed solution is to remove the default primary key setting altogether. I think that selecting the primary key is a conscious decision that should be marked explicitly. While the first column may often be a decent candidate, very often it may not be the right one.
Alternatively, we could use the first column as primary key only if it doesn't contain NULL values - but that makes it even more complicated, and makes the presence of primary key (structural property) depend not only on the structure but also actual data (in some runs the data may contain nulls and in others it may not). So IMO best solution is to just change the default to primary_key=[].
Is it possible to skip checking the data first, but catch the "non-nullable column" error and convert it to a better error?
I'm also OK with requiring the primary key to be specified; most users understand the concept of a primary key and if they don't they're not going to understand the error they get with the default anyway.
Is it possible to skip checking the data first, but catch the "non-nullable column" error and convert it to a better error?
We could do that, but it is making the whole logic more complicated, I'm not sure if it is worth it. We'd need a robust way to distinguish the primary-key related errors from other errors to avoid making it even worse if we make a mistake when converting the errors.
@jdunkerley Would like your thoughts about this.
I think we should remove the default.
But even if we do - I realized that does not solve the problem fully yet - the user can still explicitly choose a column containing Nothing values to be a PRIMARY KEY. This will still result in the not so helpful There was an SQL error: NULL result in a non-nullable column. error. We should probably still improve the error handling here.
We can either:
- detect this error and replace it with a more meaningful one (not easy to correlate which column was the culprit)
- add a pre-check for NULL values in primary key column and report an error (a bit less efficient)
- OR combine both and if we detect such error, then check for primary key columns to refine the error
Greg Travis reports a new STANDUP for today (2024-09-10):
Progress: Wrote banker’s rounding builtin for snowflake. Now handling primary key columns with null values and generating a better error It should be finished by 2024-09-16.
Next Day: get tests passing
Greg Travis reports a new STANDUP for today (2024-09-11):
Progress: Standup: NULLs in primary keys: throw an error containing an example row; tests It should be finished by 2024-09-16.
Next Day: remove deafult primary key from table upload
Greg Travis reports a new STANDUP for today (2024-09-12):
Progress: Standup: Travel; NULL primary key handling across all backends It should be finished by 2024-09-16.
Next Day: remove deafult primary key from table upload
Greg Travis reports a new STANDUP for yesterday (2024-09-13):
Progress: Standup: Offsite; NULL primary key handling across all backends It should be finished by 2024-09-16.
Next Day: remove deafult primary key from table upload
Greg Travis reports a new STANDUP for today (2024-09-16):
Progress: GUI testing; got Snowflake CI tests passing for nulls in primary key columns It should be finished by 2024-09-17.
Next Day: remove deafult primary key from table upload
Greg Travis reports a new STANDUP for today (2024-09-17):
Progress: tested frontend, investigated stack overflow in date subtraction; second PR for nulls in primary key colums It should be finished by 2024-09-18.
Next Day: rounding rounding rounding