data-prep-kit
data-prep-kit copied to clipboard
[Bug] pdf2parquet must calculate hash and size on the file
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Tools/ingest2parquet
What happened + What you expected to happen
I had duplicate documents (see attached). I was expecting the exact same duplicate files to have same size and hash. But seems like the hash is being calculated on 'contents' which is actual content + meta data (like file name)
I think the hash and size should be calculated on the actual file/document not on meta data.
Expected Behaviour
hashshould be identical to identical filessizeshould be physical file size in bytes- to avoid confusion, these columns can be renamed (or new columns can be created) with names like
file_hashandfile_size
Reproduction script
Create a copy of the above file
execute the pdf2parquet section here : https://github.com/sujee/data-prep-kit-examples/blob/main/dpk-intro/dpk_intro_1_python.ipynb
Anything else
No response
OS
Ubuntu
Python
3.11.x
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
At the moment the hash column contains the hash of the actual contents column. This is the JSON representation of the output, which has the property file-info.filename, so different filenames will have different content.
Internally, the JSON has a property file-info.document-hash which is the actual hash of the binary input file.
It could indeed make sense to expose that one as well. Where? Should it be the document_id? Another field? Happy for an open discussion here.
I do see document_hash in the contents.
I would like to see this propagated up as a top-level column in the output parquet. Along with actual file size.
@dolfim-ibm with the new Docling integration, will this be addressed as well?
Reading again above, there were some open questions about which field to expose and with which names. The fact of exposing both is for sure a good idea, since they serve different purposes.
Should be fixed in https://github.com/IBM/data-prep-kit/pull/756.
@sujee Can you test and see if this can be closed?
pdf2pq now blocked on #767
please reopen if you need more help