weasel
weasel copied to clipboard
MemoryError on computing checksums for large files
When creating a command which depends on a large file (which cannot be fitted into memory), weasel still tries to load the whole file which results in a MemoryError
.
The traceback for such a run:
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/bin/weasel", line 8, in <module> sys.exit(app())
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 42, in project_run_cli project_run(
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 88, in project_run
project_run(
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 113, in project_run
update_lockfile(current_dir, cmd)
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 270, in update_lockfile
data[command["name"]] = get_lock_entry(project_dir, command)
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 286, in get_lock_entry
deps = get_fileinfo(project_dir, command.get("deps", []))
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/cli/run.py", line 308, in get_fileinfo
md5 = get_checksum(file_path) if file_path.exists() else None
File "/home/gorosz/workspace/ascend/invoice-ai-models/.venv/lib/python3.10/site-packages/weasel/util/hashing.py", line 33, in get_checksum
return hashlib.md5(Path(path).read_bytes()).hexdigest()
File "/home/gorosz/Applications/miniconda3/lib/python3.10/pathlib.py", line 1127, in read_bytes
return f.read()
MemoryError
This happens because Weasel checks the dependencies of a command and whether they've changed or not. To prevent this from happening, the large file should simply not be listed as output or input to a given command - then it won't be processed / validated.
Thanks. I just discovered this workaround for myself as well. Do you think it's feasible to use the last modification date instead of hashes? Alternatively, would it be a solution to compute hashes for file chunks to address this issue (refer to https://stackoverflow.com/questions/1131220/get-the-md5-hash-of-big-files-in-python)?