unstructured icon indicating copy to clipboard operation
unstructured copied to clipboard

bug/Discrepancy between CLI and Python Runner for Box to Azure Cognitive Search Ingestion

Open ron-unstructured opened this issue 7 months ago • 0 comments

Describe the bug There is a discrepancy between the CLI and Python when using the download_dir parameter in unstructured-ingest when running Box -> Azure Cognitive Search. The CLI correctly downloads files to the specified directory, while the Python implementation attempts to write files to the root directory, resulting in a "Read-only file system" error.

To Reproduce

  • CLI (working): unstructured-ingest box \ --box-app-config box_config_test.json \ --remote-url box://12345 \ --work-dir ./unstructured/ \ --output-dir ./unstructured/ \ --download-dir ./unstructured/ \ --num-processes 1 \ --raise-on-error \ --verbose \ --recursive \ --re-download

  • Python Runner (throw an error): runner = BoxRunner( processor_config=ProcessorConfig( work_dir="./unstructured/", verbose=True, raise_on_error=True, output_dir="./unstructured/", num_processes=1, ), read_config=ReadConfig( download_dir="./unstructured/", re_download=True, ), partition_config=PartitionConfig(), connector_config=SimpleBoxConfig( remote_url="box://12345", recursive=True, access_config=BoxAccessConfig( box_app_config="./box_config_test.json"), ), ) runner.run()

Error message: "unstructured.ingest.error.SourceConnectionError: Error in getting data from upstream data source: [Errno 30] Read-only file system: '/{here is the folder as in the box itself}'"

Expected behavior The Python implementation should respect the download_dir parameter in the ReadConfig and download files to the specified directory, just like the CLI does.

Environment Info unstructured: 0.14.9 (issue also present in version 0.12.x)

ron-unstructured avatar Jul 02 '24 18:07 ron-unstructured