data-prep-kit [Bug] pdf2parquet must calculate hash and size on the file

trafficstars

Search before asking

[X] I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I had duplicate documents (see attached). I was expecting the exact same duplicate files to have same size and hash. But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)

I think the hash and size should be calculated on the actual file/document not on meta data.

Expected Behaviour

hash should be identical to identical files
size should be physical file size in bytes
to avoid confusion, these columns can be renamed (or new columns can be created) with names like file_hash and file_size

Reproduction script

earth.pdf

Create a copy of the above file

execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Sep 20 '24 08:09 sujee

At the moment the hash column contains the hash of the actual contents column. This is the JSON representation of the output, which has the property file-info.filename, so different filenames will have different content.

Internally, the JSON has a property file-info.document-hash which is the actual hash of the binary input file.

It could indeed make sense to expose that one as well. Where? Should it be the document_id? Another field? Happy for an open discussion here.

Sep 20 '24 14:09 dolfim-ibm

I do see document_hash in the contents.

I would like to see this propagated up as a top-level column in the output parquet. Along with actual file size.

Sep 20 '24 17:09 sujee

@dolfim-ibm with the new Docling integration, will this be addressed as well?

Oct 29 '24 04:10 sujee

Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes.

Oct 29 '24 07:10 dolfim-ibm

Should be fixed in https://github.com/IBM/data-prep-kit/pull/756.

Nov 01 '24 07:11 dolfim-ibm

@sujee Can you test and see if this can be closed?

Nov 06 '24 11:11 Bytes-Explorer

pdf2pq now blocked on #767

Nov 07 '24 06:11 sujee

please reopen if you need more help

Mar 24 '25 08:03 agoyal26

data-prep-kit data-prep-kit copied to clipboard

[Bug] pdf2parquet must calculate hash and size on the file

Search before asking

Component

What happened + What you expected to happen

Expected Behaviour

Reproduction script

Anything else

OS

Python

Are you willing to submit a PR?

data-prep-kit
data-prep-kit copied to clipboard