browsertrix Add superuser API endpoints to export and import org data

Fixes #890

This PR introduces new org import and export API endpoints, as well as new Administrator deployment documentation on how to manage the process of exporting and importing orgs, including copying files between s3 buckets or from an s3 bucket to local directory as necessary.

The import endpoint supports query parameters for ignoring mismatches between the database version in the export JSON and the new cluster's database, as well as for updating org and file storage refs to point to a new storage name.

A sample JSON export is attached. An accompanying zipped directory of S3 objects that match the export is available on request (it's too large to attach here).

1798648a-d717-45e3-a717-23132ed4030b-export.json

I am leaving testing instructions intentionally bare to see if the new docs can stand on their own.

We likely eventually want to move export/import processes to async processes kicked off by the API endpoint rather than handling them within the request/response cycle. I haven't gone down that road yet as I wanted to see how the current implementation fares against existing larger organizations before committing to the additional development.

Nov 20 '23 19:11 tw4l

Assigning @Shrinks99 for docs copy check :)

Nov 22 '23 19:11 tw4l

Docs updates complete:

Org import/export moved into an admin directory
Language updated to match rest of documentation

Nov 27 '23 16:11 tw4l

Merged main into this branch, fixed conflicts.
Fixed the docs issues to align with #1476

Jan 30 '24 23:01 Shrinks99

Rebased on latest main

May 06 '24 18:05 tw4l

@ikreymer This will need to be merged prior to org deletion, as the latter is dependent on it. Has been rebased against latest main.

May 08 '24 15:05 tw4l

Nice! Great work, with the latest changes, we also need to add pages to it. One concern is the size of the export file, since it all needs to be assembled in memory (especially with including the page list). I wonder if json is the right format for this. Perhaps we version this with /export/json, in case we want to add other options down the line?

Jun 11 '24 06:06 ikreymer

Perhaps we version this with /export/json, in case we want to add other options down the line?

One option we could consider is CBOR via https://cbor2.readthedocs.io/en/latest/usage.html, which seems to support streaming output (though not necessarily asyncio friendly), but maybe doesn't matter if we're writing one object at a time? Needs a bit more more thought on how we'd structure the data to be fully streaming download and streaming upload friendly...

Jun 11 '24 07:06 ikreymer

Or, perhaps we just keep this as is, but consider switching to https://pypi.org/project/json-stream/ without changing the format. But perhaps versioning under /json still makes sense?

Jun 12 '24 16:06 ikreymer

Or, perhaps we just keep this as is, but consider switching to https://pypi.org/project/json-stream/ without changing the format. But perhaps versioning under /json still makes sense?

I was starting with the easy stuff before moving on to serialization. I do think versioning makes sense, but also want to spend a little time testing/considering the other possible serialization formats.

Jun 12 '24 16:06 tw4l

Converting to draft while working on streaming

Jun 14 '24 15:06 tw4l

@ikreymer This is rebased on main and tests added, with streaming support for import and export! Ready for review.

Jul 01 '24 22:07 tw4l

Nice work!! Tested with very large export from dev, import locally. It worked, though ran into some minor issues:

max scale on import can be lower than on export, so just need clamp it to be MAX_CRAWL_SCALE
some crawls were missing crawlerChannel and that caused a validation error, should set it to default on import

Jul 02 '24 15:07 ikreymer

Also had the local nginx time out on import, but that can be fixed with adding:

      proxy_http_version 1.1;
      proxy_read_timeout 600;
      proxy_request_buffering off;

Maybe should also do that for the ingress (again, this is for larger imports). Another option is gzip content-encoding for the export, though that can be added later.

Jul 02 '24 15:07 ikreymer

Nice work!! Tested with very large export from dev, import locally. It worked, though ran into some minor issues:
* max scale on import can be lower than on export, so just need clamp it to be MAX_CRAWL_SCALE

* some crawls were missing `crawlerChannel` and that caused a validation error, should set it to `default` on import

Should be good now as of latest commit!

Jul 02 '24 17:07 tw4l

browsertrix browsertrix copied to clipboard

Add superuser API endpoints to export and import org data

browsertrix
browsertrix copied to clipboard