sqlite-utils Mechanism for turning nested JSON into foreign keys / many-to-many

trafficstars

The GitHub JSON APIs have a really interesting convention with respect to related objects.

Consider https://api.github.com/repos/simonw/sqlite-utils/issues - here's a truncated subset:

  {
    "id": 449818897,
    "node_id": "MDU6SXNzdWU0NDk4MTg4OTc=",
    "number": 24,
    "title": "Additional Column Constraints?",
    "user": {
      "login": "IgnoredAmbience",
      "id": 98555,
      "node_id": "MDQ6VXNlcjk4NTU1",
      "avatar_url": "https://avatars0.githubusercontent.com/u/98555?v=4",
      "gravatar_id": ""
    },
    "labels": [
      {
        "id": 993377884,
        "node_id": "MDU6TGFiZWw5OTMzNzc4ODQ=",
        "url": "https://api.github.com/repos/simonw/sqlite-utils/labels/enhancement",
        "name": "enhancement",
        "color": "a2eeef",
        "default": true
      }
    ],
    "state": "open"
  }

The user column lists a complete user. The labels column has a list of labels.

Since both user and label have populated id field this is actually enough information for us to create records for them AND set up the corresponding foreign key (for user) and m2m relationships (for labels).

It would be really neat if sqlite-utils had some kind of mechanism for correctly processing these kind of patterns.

Thanks to jq there's not much need for extra customization of the shape here - if we support a narrowly defined structure users can use jq to reshape arbitrary JSON to match.

Jun 13 '19 00:06 simonw

Two challenges here:

We need a way to specify which tables should be used - e.g. "put records from the "user" key in a users table, put multiple records from the "labels" key in a table called labels" (we can pick an automatic name for the m2m table, though it might be nice to have an option to customize it)
Should we deal with nested objects? Consider https://api.github.com/repos/simonw/datasette/pulls for example:

Here we have head.user as a user, head.repo as a repo, and head.repo.owner as another user.

Ideally our mechanism for specifying which table things should be pulled out into would handle this, but it's getting a bit complicated.

Jun 13 '19 00:06 simonw

I think I can do something here with a very simple head.repo.owner path syntax. Normally this kind of syntax would have to take the difference between dictionaries and lists into account but I don't think that matters here.

Jun 13 '19 03:06 simonw

There's an interesting difference here between nested objects with a primary-key style ID and nested objects without.

If a nested object does not have a primary key, we could still shift it out to another table but it would need to be in a context where it has an automatic foreign key back to our current record.

A good example of something where that would be useful is the outageDevices key in https://github.com/simonw/pge-outages/blob/d890d09ff6e2997948028528e06c82e1efe30365/pge-outages.json#L13-L25

  {
    "outageNumber": "407367",
    "outageStartTime": "1560355216",
    "crewCurrentStatus": "PG&E repair crew is on-site working to restore power.",
    "currentEtor": "1560376800",
    "cause": "Our preliminary determination is that your outage was caused by scheduled maintenance work.",
    "estCustAffected": "3",
    "lastUpdateTime": "1560355709",
    "hazardFlag": "0",
    "latitude": "37.35629",
    "longitude": "-119.70469",
    "outageDevices": [
      {
        "latitude": "37.35409",
        "longitude": "-119.70575"
      },
      {
        "latitude": "37.35463",
        "longitude": "-119.70525"
      },
      {
        "latitude": "37.35562",
        "longitude": "-119.70467"
      }
    ],
    "regionName": "Ahwahnee"
  }

These could either be inserted into an outageDevices table that uses rowid... or we could have a mechanism where we automatically derive a primary key for them based on a hash of their data, hence avoiding creating duplicates even though we don't have a provided primary key.

Jun 13 '19 03:06 simonw

I like the term "extract" for what we are doing here, partly because that's the terminology I used in csvs-to-sqlite.

Jun 13 '19 03:06 simonw

Another complexity from the https://api.github.com/repos/simonw/datasette/pulls example:

We don't actually want head and base to be pulled out into a separate table. Our ideal table design would probably look something like this:

url: ...
id: 285698310
...
user_id: 9599 -> refs users
head_label: simonw:travis-38dev
head_ref: travis-38dev
head_sha: f274f9004302c5ca75ce89d0abfd648457957e31
head_user_id: 9599 -> refs users
head_repo_id: 107914493 -> refs repos
base_label: simonw:master
base_ref: master
base_sha: 5e8fbf7f6fbc0b63d0479da3806dd9ccd6aaa945
base_user_id: 9599 -> refs users
base_repo_id: 107914493 -> refs repos

So the nested head and base sections here, instead of being extracted into another table, were flattened into their own columns.

So perhaps we need a flatten-nested-into-columns mechanism which can be used in conjunction with a extract-to-tables mechanism.

Jun 13 '19 03:06 simonw

So maybe something like this:

curl https://api.github.com/repos/simonw/datasette/pulls?state=all | \
  sqlite-utils insert git.db pulls - \
    --flatten=base \
    --flatten=head \
    --extract=user:users:id \
    --extract=head_repo.license:licenses:key \
    --extract=head_repo.owner:users \
    --extract=head_repo
    --extract=base_repo.license:licenses:key \
    --extract=base_repo.owner:users \
    --extract=base_repo

Is the order of those nested --extract lines significant I wonder? It would be nice if the order didn't matter and the code figured out the right execution plan on its own.

Jun 13 '19 04:06 simonw

So for --extract the format is path-to-property:table-to-extract-to:primary-key

If we find an array (as opposed to a direct nested object) at the end of the dotted path we do a m2m table.

And if primary-key is omitted maybe we do the rowid thing with a foreign key back to ourselves.

Jun 13 '19 04:06 simonw

I may ignore --flatten for the moment - users can do their own flattening using jq if they need that.

curl https://api.github.com/repos/simonw/datasette/pulls?state=all | jq "
    [.[] | . + {
        base_label: .base.label,
        base_ref: .base.ref,
        base_sha: .base.sha,
        base_user: .base.user,
        base_repo: .base.repo,
        head_label: .head.label,
        head_ref: .head.ref,
        head_sha: .head.sha,
        head_user: .head.user,
        head_repo: .head.repo
    } | del(.base, .head, ._links)]
"

Output: https://gist.github.com/simonw/2703ed43fcfe96eb8cfeee7b558b61e1

Jun 13 '19 04:06 simonw

I think the implementation of this will benefit from #23 (syntactic sugar for creating m2m records)

Jun 30 '19 17:06 simonw

Problem: extract means something else now, see #47 and the upcoming work in #42.

Sep 22 '20 07:09 simonw

i was just reaching for a tool to do this this morning

Nov 09 '21 14:11 fgregg

if you are curious about prior art, https://github.com/jsnell/json-to-multicsv is really good!

Feb 08 '22 01:02 fgregg

I ran into this. My use case has a JSON file with array of book objects with a key called reviews which is also an array of objects. My JSON is human-edited and does not specify IDs for either books or reviews. Because sqlite-utils does not support inserting nested objects, I instead have to maintain two separate CSV files with id column in books.csv and book_id column in reviews.csv.

I think the right way to declare the relationship while inserting a JSON might be to describe the relationship:

sqlite-utils insert data.db books mydata.json --hasmany reviews --hasone author --manytomany tags

This is relying on the assumption that foreign keys can point to rowid primary key.

May 31 '22 06:05 nileshtrivedi

Have you seen MakeTypes? Not the exact same thing but it may be relevant.

And it's inspired by the paper "Types from Data: Making Structured Data First-Class Citizens in F#".

Jun 29 '22 23:06 izzues

sqlite-utils sqlite-utils copied to clipboard

Mechanism for turning nested JSON into foreign keys / many-to-many

sqlite-utils
sqlite-utils copied to clipboard