Added support for ASR
Why are these changes needed?
Bridges the support for ASR - Automatic Speech Recognition feature from docling to dpk . Currently supported models : WHISPER_TINY WHISPER_SMALL WHISPER_MEDIUM WHISPER_BASE WHISPER_LARGE WHISPER_TURBO
These are all the ASR models which Docling support as of on 3/07/2025
Related issue number (if any).
Related to issue #1346
Hiii , do I need to add a test file or a sample data file or something more for the same ?
Hiii , do I need to add a test file or a sample data file or something more for the same ?
@ShiroYasha18 Any change in docling2parquet requires generating updated "expected" files, for the test-src to pass. Please run make generate-expected (for reference, see https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/language/docling2parquet/Makefile). If you are adding ASR files for testing in the test-data/input directory, the corresponding expected output files are needed.
@shahrokhDaijavad What is the use case for this?
Just to be up-to-date with the latest capabilities of Docling.
@shahrokhDaijavad : no need to be up-to-date. let's discuss. I need a viable use case before we can proceed.
Hi @touma-I
Thanks for the feedback!
The idea behind integrating ASR (Automatic Speech Recognition) support is to allow docling2parquet to process audio transcription data directly—this is especially useful in projects dealing with oral histories, interviews, podcasts, or user feedback recordings. These types of datasets are becoming increasingly common in research and user experience domains.
This integration makes the DPK pipeline compatible with speech data workflows, enabling users to extract structured insights from spoken content with minimal setup. It aligns with Docling's existing support and helps bridge that capability into DPK for broader utility.
Example Real World use case:Companies often conduct video calls (e.g., via Zoom or Google Meet) with users. These are saved as .mp4 files. This ASR integration allows automatic transcription and ingestion of those interviews within the data-prep-kit pipeline
@ShiroYasha18 Sure. It is good for DPK to keep us with the latest Docling capabilities, but for us, it only makes sense to add ASR features when there is a specific use case (or client need) in which processing of sound files are followed by one or more DPK transforms in a real application recipe, either in pre-training or post-training LLM applications. As soon as we can find such a use case, we can come back to this PR.