data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Example notebook finding no input files in ededup

Open daw3rd opened this issue 1 year ago • 4 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Other

What happened + What you expected to happen

Getting error messages during ededup section saying there are no input files.

Reproduction script

Download zip from data-prep-kit repo into /Users/dawood/Downloads/data-prep-kit-dev.zip

mkdir /tmp/example
cp data-prep-kit-dev.zip /tmp/example
git clone ...
cd data-prep-kit/examples
make venv
make jupyter

Edit notebook

zip_input_folder = "/tmp/example"

Run notebook through ededup section and get logged messages say no input files

3:26:29 INFO - Running locally
13:26:29 INFO - exact dedup params are {'hash_cpu': 0.5, 'num_hashes': 2, 'doc_column': 'contents'}
13:26:29 INFO - data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out
13:26:29 INFO - data factory data_ max_files -1, n_sample -1
13:26:29 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.parquet']
13:26:29 INFO - number of workers 3 worker options {'num_cpus': 0.8}
13:26:29 INFO - pipeline id pipeline_id; number workers 3
13:26:29 INFO - job details {'job category': 'preprocessing', 'job name': 'ededup', 'job type': 'ray', 'job id': 'job_id'}
13:26:29 INFO - code location None
13:26:29 INFO - actor creation delay 0
2024-05-14 13:26:31,387	INFO worker.py:1715 -- Started a local Ray instance. View the dashboard at 127.0.0.1:8265 
(orchestrate pid=24207) 13:26:32 INFO - orchestrator started at 2024-05-14 13:26:32
(orchestrate pid=24207) 13:26:32 ERROR - No input files to process - exiting
13:26:42 INFO - Completed execution in 0.21104646523793538 min, execution result 0

### Anything else

_No response_

### OS

MacOS (limited support)

### Python

3.10.x

### Are you willing to submit a PR?

- [ ] Yes I am willing to submit a PR!

daw3rd avatar May 14 '24 17:05 daw3rd

I do not think it is a bug. Input folder in this case is not configured correctly. Execution thinks that:

data factory data_ is using local data access: input_folder - test-data/parquet_input output_folder - test-data/ededup_out

where your input folder should be /tmp/example

blublinsky avatar May 14 '24 18:05 blublinsky

@blublinsky agreed. user error. however, we need to expect the user to make this sort of mistake and help them fix it.

daw3rd avatar May 17 '24 16:05 daw3rd

@daw3rd. Agreed, but this is not a bug. We can ask for enhancement for better error handling, but do not qualify it as a bug

blublinsky avatar May 17 '24 17:05 blublinsky

@shivdeep-singh-ibm Has this been done?

Bytes-Explorer avatar Jun 06 '24 11:06 Bytes-Explorer

closing this one since notebooks have been redesigned/tested.

daw3rd avatar Sep 12 '24 16:09 daw3rd