data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

DPK's Docling2parquet CSV ingestion

Open ShiroYasha18 opened this issue 5 months ago • 3 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

Python Runtime

What happened + What you expected to happen

I was testing the CSV conversion to parquet using docling2parquet. I realised that csv are not identified in the mime type inspite of docling having a backend code and support for that. I saw a similar kind of error in one of the issues raised by @shahrokhDaijavad where the problem I think was with html ingestion I think and gets probably solved by updating mime type. I am also updating the mime logic and this :examples/notebooks/hap/generate_hap_score_csv.ipynb in which for csv conversion pandas is being used .

Reproduction script

.

Anything else

No response

OS

MacOS

Python

3.10

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

ShiroYasha18 avatar Jul 14 '25 11:07 ShiroYasha18

@ShiroYasha18 Despite the fact that I added the mime type fix in #1381 and tested that it works correctly, we will probably not merge #1381, for two reasons:

  1. Support for csv to parquet is a couple of lines of code using pandas
import pandas as pd
df = pd.read_csv("input.csv")
df.to_parquet("output.parquet")
  1. We are not saying anything about supporting csv files in the README file for docling2parquet in DPK.

So, unless docling is doing something special with csv files, this is not a bug.

shahrokhDaijavad avatar Jul 14 '25 23:07 shahrokhDaijavad

helllo @shahrokhDaijavad sir, Thank you soo much for replying.

  1. Support for csv to parquet is a couple of lines of code using pandas

Yes, that’s absolutely true. However, while exploring Docling’s code, I noticed that it has a separate backend for CSV ingestion: https://github.com/docling-project/docling/blob/main/docling/backend/csv_backend.py

From what I understand, this backend provides more structured control compared to plain pandas — especially around metadata handling, MIME typing, and uniform parsing through DoclingDocument. This kind of control and consistency seems aligned with Docling’s goal of supporting multiple formats through a common structure.

That made me curious — if Docling is already supporting CSVs at that level, would DPK follow the same path and treat CSVs as a first-class format, using the same abstraction?

  1. We are not saying anything about supporting csv files in the README file for docling2parquet in DPK.

Totally fair — I just wanted to ask because DPK seems to rely on Docling for ingestion, and since Docling already supports .csv (and even lists it in its official docs, I was curious whether DPK plans to reflect that too thats all !

ShiroYasha18 avatar Jul 14 '25 23:07 ShiroYasha18

ok, thanks, @ShiroYasha18. We will consider whether there is truly any advantage in the CSV backend of Docling. cc: @touma-I

shahrokhDaijavad avatar Jul 15 '25 00:07 shahrokhDaijavad