aleph icon indicating copy to clipboard operation
aleph copied to clipboard

Run ingestors on the CLI to generate "ftm-bundles"

Open sunu opened this issue 3 years ago • 0 comments

What is an ftm-bundle?

An ftm-bundle is a zip file containing structured FtM entities and document blobs. The structure of the zip file may look something like:

bundle.zip/
  entities.json
  index.json
  archive/
    ab/cd/..
    a1/b2/..

Aleph should know how to load an ftm-bundle into a collection without the need pass the files to ingest-file. So ideally when an ftm-bundle is uploaded to Aleph, Aleph will copy the entities into ftm-store and the document blobs to the document archive (gcs, s3 etc) and trigger an reindex.

The ingestor script that generates an ftm-bundle from a directory may look something like this:

./ingest-bundle.sh --parallel 6 --source-dir my_data --dest-path . --dest-file my_data.ftmbundle

why is this useful?

  • If we can run ingestors on the CLI to generate ftm-bundles, it should help in debugging the ingest process and processing large amounts of data incrementally. When we encounter bugs in ingestors, it will also help us iterate quickly without the need to reupload the data to Aleph or waiting for the deployment of a new Aleph version.
  • ftm-bundles will enable us to share both structured and unstructured data across Aleph instances. See https://github.com/alephdata/aleph/discussions/1523
  • If we print a nice summary of errors after an ingestors run on the CLI, it will be helpful in large scale data processing in figuring out which files are problematic.
  • It will be helpful for organizations that run Aleph but don't have the infrastructure to monitor the logs of ingest-file service or run many parallel workers to do faster processing. They will be able to process the data offline and upload the artifacts to Aleph which should make the process a bit easier to monitor and manage.

sunu avatar Feb 11 '22 06:02 sunu