Fuzzy Dedupe fail
Discussed in https://github.com/data-prep-kit/data-prep-kit/discussions/1351
Originally posted by CaiYiGitHub July 1, 2025
Deploying data prep kit on Windows, when running Step-7: Fuzzy Dedupe in the example pdf-processe-1, always reports the following error. I don't know if I have encountered it and how to handle it
@MaryamZahiri, @ShiroYasha18 and @@dobromiriiliev. Do any of you work on a Windows machine to look into this?
hi @shahrokhDaijavad I have windows machine. Could you please help me steps to reproduce? I am happy to sumbit a PR for the closure of this issue. Thanks
Hi, @Raghav-Bell. Thanks. I will explain more tomorrow.
hi @shahrokhDaijavad
Thanks for quick discussion about this issue.
I have reproduced this issue at step 4.1, it is known issue with some hugging face models or in our case ds4sd--docling-layout-heron installation on windows.
It is due to symlink are not allowed on windows.
https://github.com/data-prep-kit/data-prep-kit/blob/851a61f9fed7ff7785ccf730f11689d2908bae48/data-processing-lib/python/src/data_processing/runtime/pure_python/transform_orchestrator.py#L132
Followings are the tracebacks:
ERROR - Exception creating transform [WinError 1314] A required privilege is not held by the client: '..\\..\\blobs\\a6344aac8c09253b3b630fb776ae94478aa0275b' -> 'C:\\Users\\Test\\.cache\\huggingface\\hub\\models--ds4sd--docling-layout-heron\\snapshots\\bdb7099d742220552d703932cc0ce0a26a7a8da8\\.gitattributes'
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\huggingface_hub\file_download.py", line 1184, in _hf_hub_download_to_cache_dir
_create_symlink(blob_path, pointer_path, new_blob=True)
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\huggingface_hub\file_download.py", line 734, in _create_symlink
os.symlink(src_rel_or_abs, abs_dst)
OSError: [WinError 1314] A required privilege is not held by the client: '..\\..\\blobs\\a6344aac8c09253b3b630fb776ae94478aa0275b' -> 'C:\\Users\\Test\\.cache\\huggingface\\hub\\models--ds4sd--docling-layout-heron\\snapshots\\bdb7099d742220552d703932cc0ce0a26a7a8da8\\.gitattributes'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 111, in orchestrate
_process_transforms(
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 183, in _process_transforms
executor = PythonTransformFileProcessor(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_file_processor.py", line 55, in __init__
raise UnrecoverableException("failed creating transform")
data_processing.utils.unrecoverable.UnrecoverableException: failed creating transform
23:28:05 ERROR - Exception during execution failed creating transform: None
2025-09-29 23:28:05,000 - ERROR - Exception during execution failed creating transform: None
Traceback (most recent call last):
File "C:\Users\Test\anaconda3\envs\data-prep-kit-1\Lib\site-packages\data_processing\runtime\pure_python\transform_orchestrator.py", line 132, in orchestrate
stats["processing_time"] = round(stats["processing_time"], 3)
~~~~~^^^^^^^^^^^^^^^^^^^
KeyError: 'processing_time'
23:28:05 ERROR - Exception during execution 'processing_time': None
2025-09-29 23:28:05,004 - ERROR - Exception during execution 'processing_time': None
23:28:05 INFO - Completed execution in 3.004 min, execution result 1
2025-09-29 23:28:05,005 - INFO - Completed execution in 3.004 min, execution result 1
CPU times: total: 24.3 s
Wall time: 3min 14s
---------------------------------------------------------------------------
Exception Traceback (most recent call last)
Cell In[8], line 1
----> 1 get_ipython().run_cell_magic('time', '', '\nfrom dpk_docling2parquet.transform_python import Docling2Parquet\nfrom dpk_docling2parquet.transform import docling2parquet_contents_types\n\nSTAGE = 1\nprint (f"🏃🏼 STAGE-{STAGE}: Processing input=\'{input_dir}\' --> output=\'{output_docling2pq_dir}\'\\n", flush=True)\n\nresult = Docling2Parquet(input_folder= input_dir,\n output_folder= output_docling2pq_dir,\n data_files_to_use=[\'.pdf\'],\n docling2parquet_contents_type=docling2parquet_contents_types.MARKDOWN, # markdown\n ).transform()\n\nif result == 0:\n print (f"✅ Stage:{STAGE} completed successfully")\nelse:\n raise Exception (f"❌ Stage:{STAGE} failed")\n')
OS: Windows 11 python: 3.11 data_prep_toolkit_transforms: 1.1.4
Refer: docling-project/docling#961, pyannote/pyannote-audio#1473, OSError: [WinError 1314] when doing sentiment analysis using flair
@Raghav-Bell Thanks for your investigation of this issue and finding the symlink problem in the Windows environment. Are there any workarounds to try?
hi @shahrokhDaijavad
As per huggingface-hub documentation, user should enable developer mode on windows.
or it will be better to use Windows Subsytem for Linux (WSL2) on windows.
Please check following links for more details:
Manage huggingface_hub cache-system
huggingface/huggingface_hub#1062
huggingface/huggingface_hub#2284
Windows Subsystem for Linux Documentation
Thanks
Thanks, @Raghav-Bell. We know that WSL would work, but for Native Windows, if it becomes a priority for a class of users, we will come back to this. For now, please see if you can find other issues that are easier to address.