browsertrix
browsertrix copied to clipboard
Add superuser API endpoints to export and import org data
Fixes #890
This PR introduces new org import and export API endpoints, as well as new Administrator deployment documentation on how to manage the process of exporting and importing orgs, including copying files between s3 buckets or from an s3 bucket to local directory as necessary.
The import endpoint supports query parameters for ignoring mismatches between the database version in the export JSON and the new cluster's database, as well as for updating org and file storage refs to point to a new storage name.
A sample JSON export is attached. An accompanying zipped directory of S3 objects that match the export is available on request (it's too large to attach here).
1798648a-d717-45e3-a717-23132ed4030b-export.json
I am leaving testing instructions intentionally bare to see if the new docs can stand on their own.
We likely eventually want to move export/import processes to async processes kicked off by the API endpoint rather than handling them within the request/response cycle. I haven't gone down that road yet as I wanted to see how the current implementation fares against existing larger organizations before committing to the additional development.
Assigning @Shrinks99 for docs copy check :)
Docs updates complete:
- Org import/export moved into an
admin
directory - Language updated to match rest of documentation
- Merged main into this branch, fixed conflicts.
- Fixed the docs issues to align with #1476
Rebased on latest main
@ikreymer This will need to be merged prior to org deletion, as the latter is dependent on it. Has been rebased against latest main.
Nice! Great work, with the latest changes, we also need to add pages to it.
One concern is the size of the export file, since it all needs to be assembled in memory (especially with including the page list). I wonder if json is the right format for this. Perhaps we version this with /export/json
, in case we want to add other options down the line?
Perhaps we version this with
/export/json
, in case we want to add other options down the line?
One option we could consider is CBOR via https://cbor2.readthedocs.io/en/latest/usage.html, which seems to support streaming output (though not necessarily asyncio friendly), but maybe doesn't matter if we're writing one object at a time? Needs a bit more more thought on how we'd structure the data to be fully streaming download and streaming upload friendly...
Or, perhaps we just keep this as is, but consider switching to https://pypi.org/project/json-stream/ without changing the format. But perhaps versioning under /json
still makes sense?
Or, perhaps we just keep this as is, but consider switching to https://pypi.org/project/json-stream/ without changing the format. But perhaps versioning under
/json
still makes sense?
I was starting with the easy stuff before moving on to serialization. I do think versioning makes sense, but also want to spend a little time testing/considering the other possible serialization formats.
Converting to draft while working on streaming
@ikreymer This is rebased on main and tests added, with streaming support for import and export! Ready for review.
Nice work!! Tested with very large export from dev, import locally. It worked, though ran into some minor issues:
- max scale on import can be lower than on export, so just need clamp it to be MAX_CRAWL_SCALE
- some crawls were missing
crawlerChannel
and that caused a validation error, should set it todefault
on import
Also had the local nginx time out on import, but that can be fixed with adding:
proxy_http_version 1.1;
proxy_read_timeout 600;
proxy_request_buffering off;
Maybe should also do that for the ingress (again, this is for larger imports). Another option is gzip content-encoding for the export, though that can be added later.
Nice work!! Tested with very large export from dev, import locally. It worked, though ran into some minor issues:
* max scale on import can be lower than on export, so just need clamp it to be MAX_CRAWL_SCALE * some crawls were missing `crawlerChannel` and that caused a validation error, should set it to `default` on import
Should be good now as of latest commit!