sqlite-utils
sqlite-utils copied to clipboard
Mechanism for turning nested JSON into foreign keys / many-to-many
The GitHub JSON APIs have a really interesting convention with respect to related objects.
Consider https://api.github.com/repos/simonw/sqlite-utils/issues - here's a truncated subset:
{
"id": 449818897,
"node_id": "MDU6SXNzdWU0NDk4MTg4OTc=",
"number": 24,
"title": "Additional Column Constraints?",
"user": {
"login": "IgnoredAmbience",
"id": 98555,
"node_id": "MDQ6VXNlcjk4NTU1",
"avatar_url": "https://avatars0.githubusercontent.com/u/98555?v=4",
"gravatar_id": ""
},
"labels": [
{
"id": 993377884,
"node_id": "MDU6TGFiZWw5OTMzNzc4ODQ=",
"url": "https://api.github.com/repos/simonw/sqlite-utils/labels/enhancement",
"name": "enhancement",
"color": "a2eeef",
"default": true
}
],
"state": "open"
}
The user column lists a complete user. The labels column has a list of labels.
Since both user and label have populated id field this is actually enough information for us to create records for them AND set up the corresponding foreign key (for user) and m2m relationships (for labels).
It would be really neat if sqlite-utils had some kind of mechanism for correctly processing these kind of patterns.
Thanks to jq there's not much need for extra customization of the shape here - if we support a narrowly defined structure users can use jq to reshape arbitrary JSON to match.
Two challenges here:
-
We need a way to specify which tables should be used - e.g. "put records from the
"user"key in auserstable, put multiple records from the"labels"key in a table calledlabels" (we can pick an automatic name for the m2m table, though it might be nice to have an option to customize it) -
Should we deal with nested objects? Consider https://api.github.com/repos/simonw/datasette/pulls for example:
Here we have head.user as a user, head.repo as a repo, and head.repo.owner as another user.
Ideally our mechanism for specifying which table things should be pulled out into would handle this, but it's getting a bit complicated.
I think I can do something here with a very simple head.repo.owner path syntax. Normally this kind of syntax would have to take the difference between dictionaries and lists into account but I don't think that matters here.
There's an interesting difference here between nested objects with a primary-key style ID and nested objects without.
If a nested object does not have a primary key, we could still shift it out to another table but it would need to be in a context where it has an automatic foreign key back to our current record.
A good example of something where that would be useful is the outageDevices key in https://github.com/simonw/pge-outages/blob/d890d09ff6e2997948028528e06c82e1efe30365/pge-outages.json#L13-L25
{
"outageNumber": "407367",
"outageStartTime": "1560355216",
"crewCurrentStatus": "PG&E repair crew is on-site working to restore power.",
"currentEtor": "1560376800",
"cause": "Our preliminary determination is that your outage was caused by scheduled maintenance work.",
"estCustAffected": "3",
"lastUpdateTime": "1560355709",
"hazardFlag": "0",
"latitude": "37.35629",
"longitude": "-119.70469",
"outageDevices": [
{
"latitude": "37.35409",
"longitude": "-119.70575"
},
{
"latitude": "37.35463",
"longitude": "-119.70525"
},
{
"latitude": "37.35562",
"longitude": "-119.70467"
}
],
"regionName": "Ahwahnee"
}
These could either be inserted into an outageDevices table that uses rowid... or we could have a mechanism where we automatically derive a primary key for them based on a hash of their data, hence avoiding creating duplicates even though we don't have a provided primary key.
I like the term "extract" for what we are doing here, partly because that's the terminology I used in csvs-to-sqlite.
Another complexity from the https://api.github.com/repos/simonw/datasette/pulls example:
We don't actually want head and base to be pulled out into a separate table. Our ideal table design would probably look something like this:
url: ...id:285698310- ...
user_id:9599-> refsusershead_label:simonw:travis-38devhead_ref:travis-38devhead_sha:f274f9004302c5ca75ce89d0abfd648457957e31head_user_id:9599-> refsusershead_repo_id:107914493-> refsreposbase_label:simonw:masterbase_ref:masterbase_sha:5e8fbf7f6fbc0b63d0479da3806dd9ccd6aaa945base_user_id:9599-> refsusersbase_repo_id:107914493-> refsrepos
So the nested head and base sections here, instead of being extracted into another table, were flattened into their own columns.
So perhaps we need a flatten-nested-into-columns mechanism which can be used in conjunction with a extract-to-tables mechanism.
So maybe something like this:
curl https://api.github.com/repos/simonw/datasette/pulls?state=all | \
sqlite-utils insert git.db pulls - \
--flatten=base \
--flatten=head \
--extract=user:users:id \
--extract=head_repo.license:licenses:key \
--extract=head_repo.owner:users \
--extract=head_repo
--extract=base_repo.license:licenses:key \
--extract=base_repo.owner:users \
--extract=base_repo
Is the order of those nested --extract lines significant I wonder? It would be nice if the order didn't matter and the code figured out the right execution plan on its own.
So for --extract the format is path-to-property:table-to-extract-to:primary-key
If we find an array (as opposed to a direct nested object) at the end of the dotted path we do a m2m table.
And if primary-key is omitted maybe we do the rowid thing with a foreign key back to ourselves.
I may ignore --flatten for the moment - users can do their own flattening using jq if they need that.
curl https://api.github.com/repos/simonw/datasette/pulls?state=all | jq "
[.[] | . + {
base_label: .base.label,
base_ref: .base.ref,
base_sha: .base.sha,
base_user: .base.user,
base_repo: .base.repo,
head_label: .head.label,
head_ref: .head.ref,
head_sha: .head.sha,
head_user: .head.user,
head_repo: .head.repo
} | del(.base, .head, ._links)]
"
Output: https://gist.github.com/simonw/2703ed43fcfe96eb8cfeee7b558b61e1
I think the implementation of this will benefit from #23 (syntactic sugar for creating m2m records)
Problem: extract means something else now, see #47 and the upcoming work in #42.
i was just reaching for a tool to do this this morning
if you are curious about prior art, https://github.com/jsnell/json-to-multicsv is really good!
I ran into this. My use case has a JSON file with array of book objects with a key called reviews which is also an array of objects. My JSON is human-edited and does not specify IDs for either books or reviews. Because sqlite-utils does not support inserting nested objects, I instead have to maintain two separate CSV files with id column in books.csv and book_id column in reviews.csv.
I think the right way to declare the relationship while inserting a JSON might be to describe the relationship:
sqlite-utils insert data.db books mydata.json --hasmany reviews --hasone author --manytomany tags
This is relying on the assumption that foreign keys can point to rowid primary key.
Have you seen MakeTypes? Not the exact same thing but it may be relevant.
And it's inspired by the paper "Types from Data: Making Structured Data First-Class Citizens in F#".