DPK's Docling2parquet CSV ingestion
Search before asking
- [x] I searched the issues and found no similar issues.
Component
Python Runtime
What happened + What you expected to happen
I was testing the CSV conversion to parquet using docling2parquet. I realised that csv are not identified in the mime type inspite of docling having a backend code and support for that. I saw a similar kind of error in one of the issues raised by @shahrokhDaijavad where the problem I think was with html ingestion I think and gets probably solved by updating mime type. I am also updating the mime logic and this :examples/notebooks/hap/generate_hap_score_csv.ipynb in which for csv conversion pandas is being used .
Reproduction script
.
Anything else
No response
OS
MacOS
Python
3.10
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
@ShiroYasha18 Despite the fact that I added the mime type fix in #1381 and tested that it works correctly, we will probably not merge #1381, for two reasons:
- Support for csv to parquet is a couple of lines of code using pandas
import pandas as pd
df = pd.read_csv("input.csv")
df.to_parquet("output.parquet")
- We are not saying anything about supporting csv files in the README file for docling2parquet in DPK.
So, unless docling is doing something special with csv files, this is not a bug.
helllo @shahrokhDaijavad sir, Thank you soo much for replying.
- Support for csv to parquet is a couple of lines of code using pandas
Yes, that’s absolutely true. However, while exploring Docling’s code, I noticed that it has a separate backend for CSV ingestion: https://github.com/docling-project/docling/blob/main/docling/backend/csv_backend.py
From what I understand, this backend provides more structured control compared to plain pandas — especially around metadata handling, MIME typing, and uniform parsing through DoclingDocument. This kind of control and consistency seems aligned with Docling’s goal of supporting multiple formats through a common structure.
That made me curious — if Docling is already supporting CSVs at that level, would DPK follow the same path and treat CSVs as a first-class format, using the same abstraction?
- We are not saying anything about supporting csv files in the README file for docling2parquet in DPK.
Totally fair — I just wanted to ask because DPK seems to rely on Docling for ingestion, and since Docling already supports .csv (and even lists it in its official docs, I was curious whether DPK plans to reflect that too thats all !
ok, thanks, @ShiroYasha18. We will consider whether there is truly any advantage in the CSV backend of Docling. cc: @touma-I