SQLite.jl icon indicating copy to clipboard operation
SQLite.jl copied to clipboard

NaN is not handled correctly

Open ValentinKaisermayer opened this issue 5 years ago • 8 comments

import SQLite, DataFrames

vals = rand(10)
vals[2:3] .= NaN
df = DataFrames.DataFrame(:A => vals)

db = SQLite.DB("test.db")
SQLite.DBInterface.execute(db, "DROP TABLE IF EXISTS data;")
SQLite.load!(df, db, "data")

db = SQLite.DB("test.db")
df = SQLite.DBInterface.execute(db, "SELECT * FROM data") |> DataFrames.DataFrame

However, the two NaN are missing from the table.

8×1 DataFrame
│ Row │ A        │
│     │ Float64  │
├─────┼──────────┤
│ 1   │ 0.128066 │
│ 2   │ 0.936917 │
│ 3   │ 0.500213 │
│ 4   │ 0.939719 │
│ 5   │ 0.176642 │
│ 6   │ 0.41554  │
│ 7   │ 0.630911 │
│ 8   │ 0.887876 │

I'm not sure if this is how it should be.

Missing is handled correctly:

vals = convert(Vector{Union{Float64,Missing}}, rand(10))
vals[2:3] .= missing
df = DataFrames.DataFrame(:A => vals)

db = SQLite.DB("test2.db")
SQLite.DBInterface.execute(db, "DROP TABLE IF EXISTS data;")
SQLite.load!(df, db, "data")

db = SQLite.DB("test2.db")
df = SQLite.DBInterface.execute(db, "SELECT * FROM data") |> DataFrames.DataFrame
10×1 DataFrame
│ Row │ A        │
│     │ Float64? │
├─────┼──────────┤
│ 1   │ 0.438104 │
│ 2   │ missing  │
│ 3   │ missing  │
│ 4   │ 0.101527 │
│ 5   │ 0.83622  │
│ 6   │ 0.360448 │
│ 7   │ 0.310123 │
│ 8   │ 0.458411 │
│ 9   │ 0.569096 │
│ 10  │ 0.141622 │

ValentinKaisermayer avatar Jun 24 '20 14:06 ValentinKaisermayer

Hmmm, from some brief research, it sounds like sqlite doesn't really follow IEEE semantics at all, and hence doesn't really have a concept of NaN. A bigger issue here is that we aren't checking the return code on the sqlite_step when loading each row into the database, which could have at least issued a warning in this case.

For a Vector{Float64} column like this, we're currently "typing" it in sqlite as REAL NOT NULL, and sqlite returns a "constraint violation" when trying to insert NaN. If I take off the NOT NULL constraint, it successfully inserts as NULL, so the values are returned as missing.

Given all that, I don't see a clear solution for the original issue here; but I do plan on doing the return code check and warning when a row fails to insert. I'd love any ideas/research other people have here.

quinnj avatar Jun 29 '20 02:06 quinnj

The problem is that the Tables.Schema interprets a vector with NaN and Float64 as a Vector{Float64}, which it actually is. However, this than results in a column of REAL NOT NULL. If I create the schema myself as Union{Float64,Missing} I can get it to store the NaNs as NULL.

ValentinKaisermayer avatar Jun 30 '20 07:06 ValentinKaisermayer

A possible solution is to insert/update IEEE-754 special values, +/- Infty and NaNs, as 8-byte blobs. In Sqlite every value can have a different SQL type than the one declared for the table column. So a column declared as REAL can be Blobs in some rows. When selecting back from REAL columns, then of course the type needs to be checked for each row, sqlite3_column_type(). If it is 8 octets long blob, then convert to Float64 (and if it is NULL convert to missing).

stephancb avatar Oct 27 '20 11:10 stephancb

If SQLite doesn't support it, most likely the solution will be tied to this Julia driver. This is like deciding how to write a Char to a JSON, that only supports Strings.

In this situation, I usually opt to not support the data type at all, and throw an error if the user tries to insert the unsupported data type to the database. In this case, a NaN value. The user will then be forced to solve the issue in the application level, deciding on how to encode the data.

felipenoris avatar Oct 27 '20 16:10 felipenoris

Well, it is bit different than unsupported data type, unless you are suggesting to not support Float64 at all. In order to throw an error, every single Float64 value getting inserted would need to be checked for Infty or NaN values (which is of course also needed for my suggested solution).

On insert Sqlite just replaces NaNs and Infty silently with NULLs, there is no error code from sqlite3_step or so. The unpleasant consequence is that for Float64 you don't always get out what has been put into the database. The suggested solutions fixes this, albeit only within the particular driver.

Addendum: If the column is declared as REAL NOT NULL, an error is of course indicated. But the problem still exists if Union{Float64, Missing} is to be inserted.

stephancb avatar Oct 27 '20 17:10 stephancb

Yes, every value should be checked for NaN or Inf. Or, equivalently, one could define a SQLiteNumber datatype that exactly matches the driver's definition and create encode/decode values to/from Float. Like the Oracle database does with OCINumber.

If the datatype does not match exactly, there's already some conversion happening before sending the value to the database. Usually the driver's C code is full of these per-value checks, and thank God this is Julia because we can do the same before hitting the C code and get the same performance. I guess this check won't hurt performance a lot, given this is most likely an IO bound operation.

felipenoris avatar Oct 29 '20 12:10 felipenoris

Yeah, I agree that the best solution here is like what @felipenoris suggests: SQLite.jl could define a SQLFloat{T} type that properly respected IEEE-754 semantics. If someone feels like making a PR + docs, I'd appreciate it.

For reference, I found this old sqlite user mailing list thread where the core devs basically said they didn't really care about supporting IEEE 😞 http://sqlite.1065341.n5.nabble.com/Handling-of-IEEE-754-nan-and-inf-in-SQLite-td27874.html

quinnj avatar Oct 30 '20 06:10 quinnj

Pse see also my question at http://sqlite.1065341.n5.nabble.com/NaN-in-0-0-out-td19086.html and Dr. Hipp's reply.

The issue cannot be completely solved without changes in Sqlite which will not happen. Regarding a driver like Sqlite.jl, the options are:

  1. Accept how Sqlite is doing things. The columns corresponding to Julia Float64 should simply be declared REAL (not REAL NOT NULL). If a Float64 array has NaNs and/or Inftys, and it is inserted and selected back, it will have Sqlite NULLs in place of these. This simply maps to Union{Float64,Missing}. Alternatively Hipp's suggested sqlite3_column_double_v2 could be implemented in Julia, e.g. by constructing a type (SQLfloat?) that has value NaN for anything that is not REAL or INTEGER in Sqlite.

  2. Disallow insertion of NaN and Inftys into SQlite, raising an exception if attempted. It could be handled via a type SQLfloat.

  3. Workaround via Sqlite BLOBs. It can ensure that what goes into Sqlite also comes out even for NaNs and Inftys, but only within conforming drivers. A Julia type SQLfloat can be a vehicle to implement this for Julia.

So which option, what exactly should a type SQLfloat do?

Edit: I just saw that Inftys are handled normally in Sqlite, only NaNs get converted to NULL and would need special treatment in SQLite.jl

stephancb avatar Oct 30 '20 10:10 stephancb