Add manifest file for MSQ export
This PR adds the capability for MSQ export to create a manifest file at the destination.
Motivation
Currently, export creates the files at the provided destination. The addition of the manifest file will provide a list of files created as part of the manifest. This will allow easier consumption of the data exported from Druid, especially for automated data pipelines. There is still a safety check that requires the destination to be empty, but this would be especially helpful if that condition is relaxed in the future. Druid currently does not support reading from a manifest file.
Structure
The manifest file created is in the symlink manifest format. The file is created at the
path <export destination>/_symlink_format_manifest/manifest. Normally, this would be <export destination>/_symlink_format_manifest/<partition path>/manifest, but since Druid does not support partitioning, the manifest is always created in the _symlink_format_manifest folder itself. Each line of the file contains an absolute path to a file created by the export.
The path is prefixed by file: if the destination is on a local disk.
Additionally, a file _symlink_format_manifest/druid_export_meta is created. The file contains additional information about the export. Currently, this only contains the manifest file version, to track which version of the manifest file was created by the export.
Example
Local storage:
└[~/export]> cat _symlink_format_manifest/manifest
file:/Users/adarshsanjeev/export/query-293c1f4c-d5ed-4b04-9690-d7d2d9db4995-worker2-partition23.csv
file:/Users/adarshsanjeev/export/query-293c1f4c-d5ed-4b04-9690-d7d2d9db4995-worker1-partition13.csv
file:/Users/adarshsanjeev/export/query-293c1f4c-d5ed-4b04-9690-d7d2d9db4995-worker0-partition24.csv
...
file:/Users/adarshsanjeev/export/query-293c1f4c-d5ed-4b04-9690-d7d2d9db4995-worker1-partition1.csv
S3 export file:
File created at s3://export-bucket/export/_symlink_format_manifest/manifest
s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker2-partition2.csv
s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker1-partition1.csv
s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker0-partition0.csv
...
s3://export-bucket/export/query-6564a32f-2194-423a-912e-eead470a37c4-worker0-partition24.csv
druid_export_meta:
version: 1
Export is still an experimental feature, and the structure of the file could be changed in the future.
Upgrade issues
- During a rolling update, older versions of workers would not return a list of exported files, and older controller would not create a manifest file. Therefore, export queries run during this time might have incomplete manifests.
Release notes
- Export queries will also create a manifest file at the destination, which lists the files created by the query.
This PR has:
- [ ] been self-reviewed.
- [ ] using the concurrency checklist (Remove this item if the PR doesn't have any relation to concurrency.)
- [ ] added documentation for new or modified features or behaviors.
- [ ] a release note entry in the PR description.
- [ ] added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
- [ ] added or updated version, license, or notice information in licenses.yaml
- [ ] added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
- [ ] added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
- [ ] added integration tests.
- [ ] been tested in a test Druid cluster.
To see if the created export file is in the symlink format, I generated manifest files using Apache Spark with Delta Lake. The generated file is in a similar format for both local disk and S3, with the only difference being that since DeltaLake uses s3a while writing, the paths in the manifest file also have the same absolute paths.
Local:
NBuser@c46fe1cca55f:/tmp/delta-table$ cat _symlink_format_manifest/manifest
file:/tmp/delta-table/part-00003-463556da-8423-41f9-a25a-0b68a51e0fff-c000.snappy.parquet
file:/tmp/delta-table/part-00005-f1682aee-f29b-41ad-8ac5-9b49d8de1394-c000.snappy.parquet
file:/tmp/delta-table/part-00007-486d1cda-5871-43f8-86cc-3a44e4adede7-c000.snappy.parquet
file:/tmp/delta-table/part-00001-50edffa4-06c3-4d70-bc8b-f67da8ef4195-c000.snappy.parquet
file:/tmp/delta-table/part-00009-dfd23cfe-e6c2-4001-ae42-8e964ec8f197-c000.snappy.parquet
S3:
s3a://export-bucket/delta_test_table2/part-00000-2c8c8389-e5a6-47f9-8394-6730c474357f-c000.snappy.parquet
s3a://export-bucket/delta_test_table2/part-00002-f897b11d-692a-427d-a1ca-9b15ef218d83-c000.snappy.parquet
...
s3a://export-bucket/delta_test_table2/part-00004-a4fc5555-6fa3-46da-a4f7-ccb6b3e8b8eb-c000.snappy.parquet