unstructured
unstructured copied to clipboard
bug/Discrepancy between CLI and Python Runner for Box to Azure Cognitive Search Ingestion
Describe the bug There is a discrepancy between the CLI and Python when using the download_dir parameter in unstructured-ingest when running Box -> Azure Cognitive Search. The CLI correctly downloads files to the specified directory, while the Python implementation attempts to write files to the root directory, resulting in a "Read-only file system" error.
To Reproduce
-
CLI (working):
unstructured-ingest box \ --box-app-config box_config_test.json \ --remote-url box://12345 \ --work-dir ./unstructured/ \ --output-dir ./unstructured/ \ --download-dir ./unstructured/ \ --num-processes 1 \ --raise-on-error \ --verbose \ --recursive \ --re-download
-
Python Runner (throw an error):
runner = BoxRunner( processor_config=ProcessorConfig( work_dir="./unstructured/", verbose=True, raise_on_error=True, output_dir="./unstructured/", num_processes=1, ), read_config=ReadConfig( download_dir="./unstructured/", re_download=True, ), partition_config=PartitionConfig(), connector_config=SimpleBoxConfig( remote_url="box://12345", recursive=True, access_config=BoxAccessConfig( box_app_config="./box_config_test.json"), ), ) runner.run()
Error message: "unstructured.ingest.error.SourceConnectionError: Error in getting data from upstream data source: [Errno 30] Read-only file system: '/{here is the folder as in the box itself}'"
Expected behavior The Python implementation should respect the download_dir parameter in the ReadConfig and download files to the specified directory, just like the CLI does.
Environment Info unstructured: 0.14.9 (issue also present in version 0.12.x)