sqlite-utils icon indicating copy to clipboard operation
sqlite-utils copied to clipboard

.extract() shouldn't extract null values

Open simonw opened this issue 5 years ago • 7 comments

This almost works, but it creates a rogue type record with a value of None.

In [1]: import sqlite_utils
In [2]: db = sqlite_utils.Database(memory=True)
In [5]: db["creatures"].insert_all([
    {"id": 1, "name": "Simon", "type": None},
    {"id": 2, "name": "Natalie", "type": None},
    {"id": 3, "name": "Cleo", "type": "dog"}], pk="id")
Out[5]: <Table creatures (id, name, type)>
In [7]: db["creatures"].extract("type")
Out[7]: <Table creatures (id, name, type_id)>
In [8]: list(db["creatures"].rows)
Out[8]: 
[{'id': 1, 'name': 'Simon', 'type_id': None},
 {'id': 2, 'name': 'Natalie', 'type_id': None},
 {'id': 3, 'name': 'Cleo', 'type_id': 2}]
In [9]: db["type"]
Out[9]: <Table type (id, type)>
In [10]: list(db["type"].rows)
Out[10]: [{'id': 1, 'type': None}, {'id': 2, 'type': 'dog'}]

simonw avatar Oct 16 '20 02:10 simonw

How should this work with extractions covering multiple columns?

If there's a single column then it makes sense that a null value would not be extracted into the lookup table, but would instead become stay as null.

For a multiple column extraction, provided at least one of those columns is not null It should map to a record in the lookup table. Only if ALL of the extracted columns are null should the lookup value stay null.

simonw avatar Oct 16 '20 03:10 simonw

The alternative solution here would be that a single null value DOES get extracted. To implement this I would need to add some logic that uses is null instead of =.

simonw avatar Oct 16 '20 03:10 simonw

Either way I think I'm going to need to add some SQL which uses where a = b or (a is null and b is null).

simonw avatar Oct 16 '20 03:10 simonw

Actually I think this should be an option to .extract() which controls if nulls are extracted or left alone. Maybe called extract_nulls=True/False.

simonw avatar Oct 16 '20 03:10 simonw

Failing test:

def test_extract_null_values(fresh_db):
    fresh_db["species"].insert({"id": 1, "species": "Wolf"}, pk="id")
    fresh_db["individuals"].insert_all(
        [
            {"id": 10, "name": "Terriana", "species": "Fox"},
            {"id": 11, "name": "Spenidorm", "species": None},
            {"id": 12, "name": "Grantheim", "species": "Wolf"},
            {"id": 13, "name": "Turnutopia", "species": None},
            {"id": 14, "name": "Wargal", "species": "Wolf"},
        ],
        pk="id",
    )
    fresh_db["individuals"].extract("species")
    assert fresh_db["species"].schema == (
        "CREATE TABLE [species] (\n"
        "   [id] INTEGER PRIMARY KEY,\n"
        "   [species] TEXT\n"
        ")"
    )
    assert fresh_db["individuals"].schema == (
        'CREATE TABLE "individuals" (\n'
        "   [id] INTEGER PRIMARY KEY,\n"
        "   [name] TEXT,\n"
        "   [species_id] INTEGER,\n"
        "   FOREIGN KEY(species_id) REFERENCES species(id)\n"
        ")"
    )
    assert list(fresh_db["species"].rows) == [
        {"id": 1, "species": "Wolf"},
        {"id": 2, "species": "Fox"},
    ]
    assert list(fresh_db["individuals"].rows) == [
        {"id": 10, "name": "Terriana", "species_id": 2},
        {"id": 11, "name": "Spenidorm", "species_id": None},
        {"id": 12, "name": "Grantheim", "species_id": 1},
        {"id": 13, "name": "Turnutopia", "species_id": None},
        {"id": 14, "name": "Wargal", "species_id": 1},
    ]

simonw avatar Oct 16 '20 16:10 simonw

I think I ran into an issue that's perhaps related with extract()

I have a case where I want to create a lookup table for all the related title data where there are possibly multiple null values in the related columns ....

sql = """\
INSERT INTO "circulation_info" ("item_id", "bib_title", "bib_creator", "bib_format", "bib_pub_year", "checkout_date")
VALUES
(1, "title one", "creator one", "Book", 2018, "2021-08-12 00:01"),
(2, "title two", "creator one", "Book", 2019, "2021-08-12 00:02"),
(3, "title three", NULL, "DVD", 2020, "2021-08-12 00:03"),
(4, "title four", NULL, "DVD", NULL, "2021-08-12 00:04"),
(5, "title five", NULL, "DVD", NULL, "2021-08-12 00:05")
"""

with sqlite3.connect('test_bib.db') as con:
    con.execute(sql)

when I run the extract() method ...

db["circulation_info"].extract(
    [
        "bib_title",
        "bib_creator",
        "bib_format"        
    ],
    table="bib_info", 
    fk_column="bib_info_id"
)

db = sqlite_utils.Database("test_bib.db")

for row in db["circulation_info"].rows:
    print(row)

print("\n---\n")

for row in db["bib_info"].rows:
    print(row)

results in this ..

{'id': 1, 'item_id': 1, 'bib_info_id': 1, 'bib_pub_year': 2018, 'checkout_date': '2021-08-12 00:01'}
{'id': 2, 'item_id': 2, 'bib_info_id': 2, 'bib_pub_year': 2019, 'checkout_date': '2021-08-12 00:02'}
{'id': 3, 'item_id': 3, 'bib_info_id': None, 'bib_pub_year': 2020, 'checkout_date': '2021-08-12 00:03'}
{'id': 4, 'item_id': 4, 'bib_info_id': None, 'bib_pub_year': None, 'checkout_date': '2021-08-12 00:04'}
{'id': 5, 'item_id': 5, 'bib_info_id': None, 'bib_pub_year': None, 'checkout_date': '2021-08-12 00:05'}

---

{'id': 1, 'bib_title': 'title one', 'bib_creator': 'creator one', 'bib_format': 'Book'}
{'id': 2, 'bib_title': 'title two', 'bib_creator': 'creator one', 'bib_format': 'Book'}
{'id': 3, 'bib_title': 'title three', 'bib_creator': None, 'bib_format': 'DVD'}
{'id': 4, 'bib_title': 'title four', 'bib_creator': None, 'bib_format': 'DVD'}
{'id': 5, 'bib_title': 'title five', 'bib_creator': None, 'bib_format': 'DVD'}

Seems like it's correctly generating the row data for those lookups, but it's not correctly updating the foreign key back to the primary table? Looks like it just results in a NULL value in that original table.

Any ideas on why? Thanks again!

rayvoelker avatar Aug 12 '21 12:08 rayvoelker

Actually, I forgot to include the bib_pub_year in the extract ...

But also, I tried again with empty string values instead of NULL values and it seems to place the foreign key properly / correctly...

sql = """\
INSERT INTO "circulation_info" ("item_id", "bib_title", "bib_creator", "bib_format", "bib_pub_year", "checkout_date")
VALUES
(1, "title one", "creator one", "Book", 2018, "2021-08-12 00:01"),
(2, "title two", "creator one", "Book", 2019, "2021-08-12 00:02"),
(3, "title three", "", "DVD", 2020, "2021-08-12 00:03"),
(4, "title four", "", "DVD", "", "2021-08-12 00:04"),
(5, "title five", "", "DVD", "", "2021-08-12 00:05")
"""

with sqlite3.connect('test_bib_2.db') as con:
    con.execute(sql)
db["circulation_info"].extract(
    [
        "bib_title",
        "bib_creator",
        "bib_format",
        "bib_pub_year"
    ],
    table="bib_info", 
    fk_column="bib_info_id"
)
{'id': 1, 'item_id': 1, 'bib_info_id': 1, 'bib_pub_year': 2018, 'checkout_date': '2021-08-12 00:01'}
{'id': 2, 'item_id': 2, 'bib_info_id': 2, 'bib_pub_year': 2019, 'checkout_date': '2021-08-12 00:02'}
{'id': 3, 'item_id': 3, 'bib_info_id': 3, 'bib_pub_year': 2020, 'checkout_date': '2021-08-12 00:03'}
{'id': 4, 'item_id': 4, 'bib_info_id': 4, 'bib_pub_year': '', 'checkout_date': '2021-08-12 00:04'}
{'id': 5, 'item_id': 5, 'bib_info_id': 5, 'bib_pub_year': '', 'checkout_date': '2021-08-12 00:05'}

---

{'id': 1, 'bib_title': 'title one', 'bib_creator': 'creator one', 'bib_format': 'Book'}
{'id': 2, 'bib_title': 'title two', 'bib_creator': 'creator one', 'bib_format': 'Book'}
{'id': 3, 'bib_title': 'title three', 'bib_creator': '', 'bib_format': 'DVD'}
{'id': 4, 'bib_title': 'title four', 'bib_creator': '', 'bib_format': 'DVD'}
{'id': 5, 'bib_title': 'title five', 'bib_creator': '', 'bib_format': 'DVD'}

rayvoelker avatar Aug 12 '21 12:08 rayvoelker