Verify file integrity of downloaded files by hash sum
Verify the file integrity of files downloaded with their hash values. Mentioned in a call by @atrisovic.
Prepare
- [x] check Python hash implementation
- [x] md5
- [x] sha-1
- [x] sha-256
- [x] sha-512
- [x] check what has to be hashed from the response: the resp.content or needs a temporary file be saved before hashing the file? ->
requests.Response.content
Implementation
- [ ] Write tests
- [ ] add argument to enable/disable this
- [ ] add argument to pass checksum algorithm to be used: default =
MD5, other =SHA-1,SHA-256orSHA-512
- [ ] Update code:
get_datafile()
import hashlib
from pyDataverse.api import NativeApi
api = NativeApi("https://data.aussda.at)
resp = api.get_datafile(3702)
m = hashlib.md5()
# m = hashlib.sha1()
# m = hashlib.sha256()
# m = hashlib.sha512()
m.update(resp.content)
m.hexdigest()
- [ ] Update Docs
- [ ] Update Docstrings
- [ ] Run pytest
- [ ] Run tox
- [ ] Run pylint
- [ ] Run mypy
Review
Follow-Ups
- [ ]
Hey @skasberger!
This is how I solved the problem for checking the checksum error in my previous project: https://github.com/atrisovic/dataverse-r-study/blob/0fc1c223ed0a0777633f94f9b7cad699003aaf7a/docker/download_dataset.py#L32-L39
I tried playing with the client to incorporate the code, but I think it's quite awkward to do it the same way. I can still share the code if you think it would be any helpful, but I think there needs to be another approach x)
As discussed during the 2024-02-14 meeting of the pyDataverse working group, we are closing old milestones in favor of a new project board at https://github.com/orgs/gdcc/projects/1 and removing issues (like this one) from those old milestones. Please feel free to join the working group! You can find us at https://py.gdcc.io and https://dataverse.zulipchat.com/#narrow/stream/377090-python