atomic-server icon indicating copy to clipboard operation
atomic-server copied to clipboard

JSON-AD Importer - atomic data publishing imports

Open joepio opened this issue 3 years ago • 3 comments

see https://github.com/ontola/atomic-data-docs/issues/93

  • [x] Allow JSON-AD importers to deal with
    • [x] localID
    • [ ] globalId
    • [x] References to other (internal) resources
    • [ ] Nested resources
  • [x] Authorization checks
  • [x] Create Importer Class Resource (and / or Endpoint?)
  • [x] Add a plugin for the Importer Class.
  • [ ] Periodic runner
  • [x] Front-end for Importer (update JS assets)
  • [ ] Webhook Parser (maybe do this later)
  • [x] CLI option atomic-server importer ./my-file --to https://localhost/imports/1 or parse STDOUT
  • [ ] Parellizable (would be awesome)

Implementation thoughts

The process of importing things can be initiated in various ways:

  • User manually imports some resource.
  • Periodic pull*: Server initiates - e.g. auto import of some external URL, checked periodically
  • Push: External service initiates. e.g. WebHooks. This makes tokens relevant.

We want a front-end that:

  • Easily instantiates Imports. Press the plus icon, create an import
  • Allows for manual refresh or automatic / periodic refresh configuration (e.g. every 24 hours) of external URLs
  • Allows pasting a JSON-AD field.
  • Allows setting rights / tokens. Ideally, you'd get a WebHook URL that you can simply copy/paste into some WebHook client that sends (POSTS?) items
  • Shows recently imported items.

The back-end:

  • Needs an extended JSON-AD Parser. I think adding an optional parent argument should suffice. This is the context / the Resource which is set as the parent for everything. Every time a resource is encountered without an @id, but with a localId, the parent is set to this resource. In the URL generation, the path is created as a child of the Importer's path. So the parent may be https://example.com/importers/twitter and the new ID will be https://example.com/importers/twitter/local_id_1.
  • Background job worker, which periodically fires to update things. Atomic-Server has the runtime, but Atomic-Lib has the Db. We could spin up some tokio periodic runtime from the Db, though, but this would mean that it may be cloned across threads. I think this should be a server thing. In any case, I'd prefer this to be designed as just another Plugin, which has some sort of periodic function handle.
  • WebHook parser. This should be handled by get_extended_resource. I think we're going to have to send the POST body to this function, too... We already parse query params, now we're also gonna parse the body. And it would probably not take very long until we also allow plugins to use HTTP headers. It would definitely make plugins more powerful, but it could also lead to a lower degree of standardization between plugins. Currently, they all work with query parameters, similar to Endpoints. This leads to a standardized API and interactive frontends that can be auto-generated. Maybe we should limit it to accept only a body if you POST and not support HTTP headers.
  • Token-based auth. Relates to webhook parsing. So we want to allow some sort of system to post things to an Importer (or some child of the importer).
  • CLI option. Sending imports over HTTP is fine for small files, but larger ones require a more performant option. Having an importer option in atomic-server cli seems logical. I guess we should allow piping JSON-AD resources here.

joepio avatar Apr 20 '22 11:04 joepio

I need to re-consider how importing happens.

So right now, the parse_json_ad_array function actually adds resources to the store. I think that fails when we try to import a resource which also includes resources with either @id or localId. So maybe adding to the store should happen far deeper.

joepio avatar Jul 15 '22 11:07 joepio

Currently, the server CLI import command needs an explicit --parent URL if you're parsing new resources. This is kind of cumbersome. I think we may need a default importer, which is created as a step in populate. An alternative is to create a new importer for every import. Maybe also acceptable?

joepio avatar Jul 20 '22 15:07 joepio

I'm finding it difficult to implement logic for authorization checks.

Attack scenario's that I want to cover:

JSON-AD containing existing resource

Attacker creates JSON-AD file that seems normal, which includes some existing resource (e.g. the Victim's Agent profile). Victim imports the JSON-AD, which overwrites their existing thing (e.g. gives Read + Write rights to Attacker, or edits public key of Agent).

I think the solution is to - by default - only allow importing items that do not overwrite resources that are outside of the hierarchy.

joepio avatar Aug 08 '22 14:08 joepio

Currently, Importer is a Class Extender. This means that you can instantiate multiple Importers, and they all have URLs.

The alternative approach, is to have one single /import Endpoint. This has some advantages:

  • Code is cleaner
  • Predictable URL

But it also has limitations, because it is not stateful / does not store any values:

  • Can't use periodic runners. That would need some instance that has values
  • Doesn't have children. We'd have to require a parent target, as well as the JSON-AD itself.

joepio avatar Feb 14 '23 11:02 joepio

I'm not planning to do the open tasks, as I don't have a clear usecase for them now. Perhaps later!

joepio avatar Sep 29 '23 08:09 joepio