data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] pdf2parquet must calculate hash and size on the file

Open sujee opened this issue 1 year ago • 2 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Tools/ingest2parquet

What happened + What you expected to happen

I had duplicate documents (see attached). I was expecting the exact same duplicate files to have same size and hash. But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)

I think the hash and size should be calculated on the actual file/document not on meta data.

image

Expected Behaviour

  • hash should be identical to identical files
  • size should be physical file size in bytes
  • to avoid confusion, these columns can be renamed (or new columns can be created) with names like file_hash and file_size

Reproduction script

earth.pdf

Create a copy of the above file

execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb

Anything else

No response

OS

Ubuntu

Python

3.11.x

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

sujee avatar Sep 20 '24 08:09 sujee

At the moment the hash column contains the hash of the actual contents column. This is the JSON representation of the output, which has the property file-info.filename, so different filenames will have different content.

Internally, the JSON has a property file-info.document-hash which is the actual hash of the binary input file.

It could indeed make sense to expose that one as well. Where? Should it be the document_id? Another field? Happy for an open discussion here.

dolfim-ibm avatar Sep 20 '24 14:09 dolfim-ibm

I do see document_hash in the contents.

I would like to see this propagated up as a top-level column in the output parquet. Along with actual file size.

image

sujee avatar Sep 20 '24 17:09 sujee

@dolfim-ibm with the new Docling integration, will this be addressed as well?

sujee avatar Oct 29 '24 04:10 sujee

Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes.

dolfim-ibm avatar Oct 29 '24 07:10 dolfim-ibm

Should be fixed in https://github.com/IBM/data-prep-kit/pull/756.

dolfim-ibm avatar Nov 01 '24 07:11 dolfim-ibm

@sujee Can you test and see if this can be closed?

Bytes-Explorer avatar Nov 06 '24 11:11 Bytes-Explorer

pdf2pq now blocked on #767

sujee avatar Nov 07 '24 06:11 sujee

please reopen if you need more help

agoyal26 avatar Mar 24 '25 08:03 agoyal26