data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Fuzzy Dedupe fail

Open shahrokhDaijavad opened this issue 5 months ago • 7 comments

Discussed in https://github.com/data-prep-kit/data-prep-kit/discussions/1351

Originally posted by CaiYiGitHub July 1, 2025 Deploying data prep kit on Windows, when running Step-7: Fuzzy Dedupe in the example pdf-processe-1, always reports the following error. I don't know if I have encountered it and how to handle it 2025-07-01_181829

shahrokhDaijavad avatar Jul 11 '25 16:07 shahrokhDaijavad

@MaryamZahiri, @ShiroYasha18 and @@dobromiriiliev. Do any of you work on a Windows machine to look into this?

shahrokhDaijavad avatar Jul 11 '25 17:07 shahrokhDaijavad

hi @shahrokhDaijavad I have windows machine. Could you please help me steps to reproduce? I am happy to sumbit a PR for the closure of this issue. Thanks

Raghav-Bell avatar Sep 28 '25 11:09 Raghav-Bell

Hi, @Raghav-Bell. Thanks. I will explain more tomorrow.

shahrokhDaijavad avatar Sep 28 '25 22:09 shahrokhDaijavad

hi @shahrokhDaijavad Thanks for quick discussion about this issue. I have reproduced this issue at step 4.1, it is known issue with some hugging face models or in our case ds4sd--docling-layout-heron installation on windows. It is due to symlink are not allowed on windows.

https://github.com/data-prep-kit/data-prep-kit/blob/851a61f9fed7ff7785ccf730f11689d2908bae48/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_orchestrator.py#L132

Followings are the tracebacks:

ERROR - Exception creating transform  [WinError 1314] A required privilege is not held by the client: '..\\..\\blobs\\a6344aac8c09253b3b630fb776ae94478aa0275b' -> 'C:\\Users\\Test\\.cache\\huggingface\\hub\\models--ds4sd--docling-layout-heron\\snapshots\\bdb7099d742220552d703932cc0ce0a26a7a8da8\\.gitattributes'

  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\huggingface_hub\file_download.py", line 1184, in _hf_hub_download_to_cache_dir
    _create_symlink(blob_path, pointer_path, new_blob=True)
  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\huggingface_hub\file_download.py", line 734, in _create_symlink
    os.symlink(src_rel_or_abs, abs_dst)
OSError: [WinError 1314] A required privilege is not held by the client: '..\\..\\blobs\\a6344aac8c09253b3b630fb776ae94478aa0275b' -> 'C:\\Users\\Test\\.cache\\huggingface\\hub\\models--ds4sd--docling-layout-heron\\snapshots\\bdb7099d742220552d703932cc0ce0a26a7a8da8\\.gitattributes'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 111, in orchestrate
    _process_transforms(
  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 183, in _process_transforms
    executor = PythonTransformFileProcessor(
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_file_processor.py", line 55, in __init__
    raise UnrecoverableException("failed creating transform")
data_processing.utils.unrecoverable.UnrecoverableException: failed creating transform
23:28:05 ERROR - Exception during execution failed creating transform: None
2025-09-29 23:28:05,000 - ERROR - Exception during execution failed creating transform: None
Traceback (most recent call last):
  File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 132, in orchestrate
    stats["processing_time"] = round(stats["processing_time"], 3)
                                     ~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'processing_time'
23:28:05 ERROR - Exception during execution 'processing_time': None
2025-09-29 23:28:05,004 - ERROR - Exception during execution 'processing_time': None
23:28:05 INFO - Completed execution in 3.004 min, execution result 1
2025-09-29 23:28:05,005 - INFO - Completed execution in 3.004 min, execution result 1
CPU times: total: 24.3 s
Wall time: 3min 14s
---------------------------------------------------------------------------
Exception                                 Traceback (most recent call last)
Cell In[8], line 1
----> 1 get_ipython().run_cell_magic('time', '', '\nfrom dpk_docling2parquet.transform_python import Docling2Parquet\nfrom dpk_docling2parquet.transform import docling2parquet_contents_types\n\nSTAGE = 1\nprint (f"🏃🏼 STAGE-{STAGE}: Processing input=\'{input_dir}\' --> output=\'{output_docling2pq_dir}\'\\n", flush=True)\n\nresult = Docling2Parquet(input_folder= input_dir,\n                    output_folder= output_docling2pq_dir,\n                    data_files_to_use=[\'.pdf\'],\n                    docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN,   # markdown\n                    ).transform()\n\nif result == 0:\n    print (f"✅ Stage:{STAGE} completed successfully")\nelse:\n    raise Exception (f"❌ Stage:{STAGE}  failed")\n')

OS: Windows 11 python: 3.11 data_prep_toolkit_transforms: 1.1.4

Refer: docling-project/docling#961, pyannote/pyannote-audio#1473, OSError: [WinError 1314] when doing sentiment analysis using flair

Raghav-Bell avatar Sep 29 '25 18:09 Raghav-Bell

@Raghav-Bell Thanks for your investigation of this issue and finding the symlink problem in the Windows environment. Are there any workarounds to try?

shahrokhDaijavad avatar Sep 29 '25 18:09 shahrokhDaijavad

hi @shahrokhDaijavad As per huggingface-hub documentation, user should enable developer mode on windows. or it will be better to use Windows Subsytem for Linux (WSL2) on windows. Please check following links for more details: Manage huggingface_hub cache-system huggingface/huggingface_hub#1062 huggingface/huggingface_hub#2284 Windows Subsystem for Linux Documentation Thanks

Raghav-Bell avatar Sep 30 '25 06:09 Raghav-Bell

Thanks, @Raghav-Bell. We know that WSL would work, but for Native Windows, if it becomes a priority for a class of users, we will come back to this. For now, please see if you can find other issues that are easier to address.

shahrokhDaijavad avatar Sep 30 '25 21:09 shahrokhDaijavad