sqlite-utils icon indicating copy to clipboard operation
sqlite-utils copied to clipboard

Mechanism for turning nested JSON into foreign keys / many-to-many

Open simonw opened this issue 5 years ago • 14 comments

The GitHub JSON APIs have a really interesting convention with respect to related objects.

Consider https://api.github.com/repos/simonw/sqlite-utils/issues - here's a truncated subset:

  {
    "id": 449818897,
    "node_id": "MDU6SXNzdWU0NDk4MTg4OTc=",
    "number": 24,
    "title": "Additional Column Constraints?",
    "user": {
      "login": "IgnoredAmbience",
      "id": 98555,
      "node_id": "MDQ6VXNlcjk4NTU1",
      "avatar_url": "https://avatars0.githubusercontent.com/u/98555?v=4",
      "gravatar_id": ""
    },
    "labels": [
      {
        "id": 993377884,
        "node_id": "MDU6TGFiZWw5OTMzNzc4ODQ=",
        "url": "https://api.github.com/repos/simonw/sqlite-utils/labels/enhancement",
        "name": "enhancement",
        "color": "a2eeef",
        "default": true
      }
    ],
    "state": "open"
  }

The user column lists a complete user. The labels column has a list of labels.

Since both user and label have populated id field this is actually enough information for us to create records for them AND set up the corresponding foreign key (for user) and m2m relationships (for labels).

It would be really neat if sqlite-utils had some kind of mechanism for correctly processing these kind of patterns.

Thanks to jq there's not much need for extra customization of the shape here - if we support a narrowly defined structure users can use jq to reshape arbitrary JSON to match.

simonw avatar Jun 13 '19 00:06 simonw

Two challenges here:

  1. We need a way to specify which tables should be used - e.g. "put records from the "user" key in a users table, put multiple records from the "labels" key in a table called labels" (we can pick an automatic name for the m2m table, though it might be nice to have an option to customize it)

  2. Should we deal with nested objects? Consider https://api.github.com/repos/simonw/datasette/pulls for example:

Mozilla_Firefox

Here we have head.user as a user, head.repo as a repo, and head.repo.owner as another user.

Ideally our mechanism for specifying which table things should be pulled out into would handle this, but it's getting a bit complicated.

simonw avatar Jun 13 '19 00:06 simonw

I think I can do something here with a very simple head.repo.owner path syntax. Normally this kind of syntax would have to take the difference between dictionaries and lists into account but I don't think that matters here.

simonw avatar Jun 13 '19 03:06 simonw

There's an interesting difference here between nested objects with a primary-key style ID and nested objects without.

If a nested object does not have a primary key, we could still shift it out to another table but it would need to be in a context where it has an automatic foreign key back to our current record.

A good example of something where that would be useful is the outageDevices key in https://github.com/simonw/pge-outages/blob/d890d09ff6e2997948028528e06c82e1efe30365/pge-outages.json#L13-L25

  {
    "outageNumber": "407367",
    "outageStartTime": "1560355216",
    "crewCurrentStatus": "PG&E repair crew is on-site working to restore power.",
    "currentEtor": "1560376800",
    "cause": "Our preliminary determination is that your outage was caused by scheduled maintenance work.",
    "estCustAffected": "3",
    "lastUpdateTime": "1560355709",
    "hazardFlag": "0",
    "latitude": "37.35629",
    "longitude": "-119.70469",
    "outageDevices": [
      {
        "latitude": "37.35409",
        "longitude": "-119.70575"
      },
      {
        "latitude": "37.35463",
        "longitude": "-119.70525"
      },
      {
        "latitude": "37.35562",
        "longitude": "-119.70467"
      }
    ],
    "regionName": "Ahwahnee"
  }

These could either be inserted into an outageDevices table that uses rowid... or we could have a mechanism where we automatically derive a primary key for them based on a hash of their data, hence avoiding creating duplicates even though we don't have a provided primary key.

simonw avatar Jun 13 '19 03:06 simonw

I like the term "extract" for what we are doing here, partly because that's the terminology I used in csvs-to-sqlite.

simonw avatar Jun 13 '19 03:06 simonw

Another complexity from the https://api.github.com/repos/simonw/datasette/pulls example:

Mozilla_Firefox

We don't actually want head and base to be pulled out into a separate table. Our ideal table design would probably look something like this:

  • url: ...
  • id: 285698310
  • ...
  • user_id: 9599 -> refs users
  • head_label: simonw:travis-38dev
  • head_ref: travis-38dev
  • head_sha: f274f9004302c5ca75ce89d0abfd648457957e31
  • head_user_id: 9599 -> refs users
  • head_repo_id: 107914493 -> refs repos
  • base_label: simonw:master
  • base_ref: master
  • base_sha: 5e8fbf7f6fbc0b63d0479da3806dd9ccd6aaa945
  • base_user_id: 9599 -> refs users
  • base_repo_id: 107914493 -> refs repos

So the nested head and base sections here, instead of being extracted into another table, were flattened into their own columns.

So perhaps we need a flatten-nested-into-columns mechanism which can be used in conjunction with a extract-to-tables mechanism.

simonw avatar Jun 13 '19 03:06 simonw

So maybe something like this:

curl https://api.github.com/repos/simonw/datasette/pulls?state=all | \
  sqlite-utils insert git.db pulls - \
    --flatten=base \
    --flatten=head \
    --extract=user:users:id \
    --extract=head_repo.license:licenses:key \
    --extract=head_repo.owner:users \
    --extract=head_repo
    --extract=base_repo.license:licenses:key \
    --extract=base_repo.owner:users \
    --extract=base_repo

Is the order of those nested --extract lines significant I wonder? It would be nice if the order didn't matter and the code figured out the right execution plan on its own.

simonw avatar Jun 13 '19 04:06 simonw

So for --extract the format is path-to-property:table-to-extract-to:primary-key

If we find an array (as opposed to a direct nested object) at the end of the dotted path we do a m2m table.

And if primary-key is omitted maybe we do the rowid thing with a foreign key back to ourselves.

simonw avatar Jun 13 '19 04:06 simonw

I may ignore --flatten for the moment - users can do their own flattening using jq if they need that.

curl https://api.github.com/repos/simonw/datasette/pulls?state=all | jq "
    [.[] | . + {
        base_label: .base.label,
        base_ref: .base.ref,
        base_sha: .base.sha,
        base_user: .base.user,
        base_repo: .base.repo,
        head_label: .head.label,
        head_ref: .head.ref,
        head_sha: .head.sha,
        head_user: .head.user,
        head_repo: .head.repo
    } | del(.base, .head, ._links)]
"

Output: https://gist.github.com/simonw/2703ed43fcfe96eb8cfeee7b558b61e1

simonw avatar Jun 13 '19 04:06 simonw

I think the implementation of this will benefit from #23 (syntactic sugar for creating m2m records)

simonw avatar Jun 30 '19 17:06 simonw

Problem: extract means something else now, see #47 and the upcoming work in #42.

simonw avatar Sep 22 '20 07:09 simonw

i was just reaching for a tool to do this this morning

fgregg avatar Nov 09 '21 14:11 fgregg

if you are curious about prior art, https://github.com/jsnell/json-to-multicsv is really good!

fgregg avatar Feb 08 '22 01:02 fgregg

I ran into this. My use case has a JSON file with array of book objects with a key called reviews which is also an array of objects. My JSON is human-edited and does not specify IDs for either books or reviews. Because sqlite-utils does not support inserting nested objects, I instead have to maintain two separate CSV files with id column in books.csv and book_id column in reviews.csv.

I think the right way to declare the relationship while inserting a JSON might be to describe the relationship:

sqlite-utils insert data.db books mydata.json --hasmany reviews --hasone author --manytomany tags

This is relying on the assumption that foreign keys can point to rowid primary key.

nileshtrivedi avatar May 31 '22 06:05 nileshtrivedi

Have you seen MakeTypes? Not the exact same thing but it may be relevant.

And it's inspired by the paper "Types from Data: Making Structured Data First-Class Citizens in F#".

izzues avatar Jun 29 '22 23:06 izzues