Hynek Kydlíček
Hynek Kydlíček
Hi, Yes it's possible. We have just added `UrlDedupSignature` for that. Something like this should do: ``` url_dedup_config = UrlDedupConfig( only_dedup_in_index=True, ) INPUT_FOLDER_1 = "data/url_dedup/input_1.jsonl" INPUT_FOLDER_2 = "data/url_dedup/input_2.jsonl" FINDER_WORKERS =...
Hi, I can't see it from screenshot but what's the value of `MAIN_OUTPUT_PATH` ? The resulting files should be saved in `{MAIN_OUTPUT_PATH}/base_processing/output/{DUMP_TO_PROCESS}` not in the logs folder
Then can you check s3://data-refine/base_processing/base_processing/output/ if it cotntains any folders ?
Strange so if you do ` aws s3 ls s3://data-refine/base_processing//base_processing/output/` you get no results ? (notice the double `//`
Hey, we don't have any community forum as of right now. Could you send the logs you got please ? (not screenshots)
Ahh, okay seems like none of the files get's throught extraction. Could you try increasing the timeout to 1 sec ? See https://github.com/huggingface/datatrove/blob/main/src/datatrove/pipeline/extractors/trafilatura.py#L26
I am good thank you for asking :) It's not a slurm problem. How did you install datatrove ? From pip or from source ? Can you run following command...
Yeah, we haven't released on pypi for a while thus we don't have locked dependency for numpy. Can you try installing the datatrove like this ? (from source) `pip install...
+3.10 should be fine
Hi, could you try processing more samples ? 10k+ ? (setting the limit variable in reader)