data-prep-kit
data-prep-kit copied to clipboard
[Bug] Example notebook finding no input files in ededup
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Other
What happened + What you expected to happen
Getting error messages during ededup section saying there are no input files.
Reproduction script
Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip
mkdir /tmp/example
cp data-prep-kit-dev.zip /tmp/example
git clone ...
cd data-prep-kit/examples
make venv
make jupyter
Edit notebook
zip_input_folder = "/tmp/example"
Run notebook through ededup section and get logged messages say no input files
3:26:29 INFO - Running locally
13:26:29 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:26:29 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:26:29 INFO - data factory data_ max_files -1, n_sample -1
13:26:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:26:29 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:26:29 INFO - pipeline id pipeline_id; number workers 3
13:26:29 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:26:29 INFO - code location None
13:26:29 INFO - actor creation delay 0
2024-05-14 13:26:31,387 INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265
(orchestrate pid=24207) 13:26:32 INFO - orchestrator started at 2024-05-14 13:26:32
(orchestrate pid=24207) 13:26:32 ERROR - No input files to process - exiting
13:26:42 INFO - Completed execution in 0.21104646523793538 min, execution result 0
### Anything else
_No response_
### OS
MacOS (limited support)
### Python
3.10.x
### Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
I do not think it is a bug. Input folder in this case is not configured correctly. Execution thinks that:
data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
where your input folder should be /tmp/example
@blublinsky agreed. user error. however, we need to expect the user to make this sort of mistake and help them fix it.
@daw3rd. Agreed, but this is not a bug. We can ask for enhancement for better error handling, but do not qualify it as a bug
@shivdeep-singh-ibm Has this been done?
closing this one since notebooks have been redesigned/tested.