Hynek Kydlíček comments

Results 28 comments of


                                            Hynek Kydlíček

URL dedup of two datasets

Hi, Yes it's possible. We have just added `UrlDedupSignature` for that. Something like this should do: ``` url_dedup_config = UrlDedupConfig( only_dedup_in_index=True, ) INPUT_FOLDER_1 = "data/url_dedup/input_1.jsonl" INPUT_FOLDER_2 = "data/url_dedup/input_2.jsonl" FINDER_WORKERS =...

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Hi, I can't see it from screenshot but what's the value of `MAIN_OUTPUT_PATH` ? The resulting files should be saved in `{MAIN_OUTPUT_PATH}/base_processing/output/{DUMP_TO_PROCESS}` not in the logs folder

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ?

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Strange so if you do ` aws s3 ls s3://data-refine/base_processing//base_processing/output/` you get no results ? (notice the double `//`

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Hey, we don't have any community forum as of right now. Could you send the logs you got please ? (not screenshots)

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ? See https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/extractors/trafilatura.py#L26

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

I am good thank you for asking :) It's not a slurm problem. How did you install datatrove ? From pip or from source ? Can you run following command...

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy. Can you try installing the datatrove like this ? (from source) `pip install...

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

+3.10 should be fine

I would like to get help from Datatrove enthusiasts regarding issues I'm facing while running the example script.

Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader)