data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Web2Parquet DataAccessLocal Update needed

Open rajeshsirsikar-bq opened this issue 4 months ago • 2 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/web2parquet

What happened + What you expected to happen

In web2parquet --> transform.py....this section ############################################################################# ## The same transform can also be used to store crawled files to local folder if self.folder: dao=DataAccessLocal(local_config={'output_folder':self.folder,'input_folder':'.'}) for x in self.docs: dao.save_file(self.folder+'/'+x['filename'], x['contents'])

Since DataAccessLocal function no moe used local_config and has been updated to config....

this code shd be modified accordingly.

Reproduction script

from dpk_web2parquet.transform import Web2Parquet from utils.config import CONFIG import os

Web2Parquet( urls=['https://thealliance.ai/'], folder='dpk_input', depth=1, downloads=1, mime_types=["text/html"] ).transform()

print("Web crawl completed. Downloaded %d files into '%s'" % (len(os.listdir(CONFIG.INPUT_DIR)), CONFIG.INPUT_DIR))

Anything else

No response

OS

MacOS

Python

3.12

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

rajeshsirsikar-bq avatar Aug 07 '25 20:08 rajeshsirsikar-bq

hi @rajeshsirsikar-bq Thanks for reporting this issue. Yes. It seems you are right. https://github.com/data-prep-kit/data-prep-kit/blob/80dbab8830ca7ac7ab62131df1dba1ad487df97e/data-processing-lib/python/src/data_processing/data_access/data_access_local.py#L34 https://github.com/data-prep-kit/data-prep-kit/blob/80dbab8830ca7ac7ab62131df1dba1ad487df97e/transforms/universal/web2parquet/dpk_web2parquet/transform.py#L110-L114 hi @shahrokhDaijavad Could you please assign this issue to me. Thanks

Raghav-Bell avatar Oct 18 '25 17:10 Raghav-Bell

@touma-I and @swith005, what do you think?

shahrokhDaijavad avatar Oct 18 '25 20:10 shahrokhDaijavad