aiida-core
aiida-core copied to clipboard
Stream directly to `packed` during `verdi archive import`
This is a very early proof-of-concept implementation of streaming repository files directly to packed
during an archive import.
Especially for large archives, in particular, with many (small) repository files, this could provide significant speed-up as it reduces the required file open and close operations when writing to the target repository. While many such operations to obtain the streams from the ZipfileBackendRepository
can probably not be avoided, they can be significantly reduced by streaming to packed
instead of loose
, as the latter would require the same number again as for reading the contents, while the former would require just one per pack (~4gb file).
I just ran the command for the
acwf-verification_unaries-verification-PBE-v1-results_quantum_espresso-SSSP-1.3-PBE-precision.aiida archive (2gb, ~60k repository files)
and the
MC3D-provenance.aiida archive (12gb, ~840k repository files)
In the first case, the total runtime of the verdi archive import
command was 176s and 107s when streaming to loose
or packed
respectively, while in the second case they were 787s and 728s. Here, I was expecting a larger speed-up, but this might originate from the non-ideal preliminary implementation and benchmarking. Overall, I think this could be a promising feature when done properly.
A few notes regarding the implementation:
- I first obtain all hashes from the source repository (
ZipFileBackendRepository
) viarepository_from.list_objects()
and then retrieve the contents asBytesIO
viarepository_from.get_object_content()
. I didn't use therepository_from.iter_object_streams()
as that returnsZipExtFiles
which couldn't be opened outside the scope of a context manager, and, importantly, when passing toContainer.add_streamed_objects_to_pack()
. - While absolutely necessary, no validation on the size is currently being done. Therefore, all streams are opened and in memory. This works on the PSI workstation with ~250gb of memory, but would otherwise crash most normal workstations. I haven't found a method of the
ZipfileBackendRepository
that returns the hashes and contents in batches (rather than one by one, such asiter_object_streams()
) so that might be something we have to implement from scratch for this feature, either based on the number of files, or their total size for each batch, using sensible defaults.
The remaining todos to have a proper implementation (ties in with the notes above):
- [ ] Use more suitable
ZipFileBackendRepository
anddisk_objectstore
Container
methods if applicable and available (e.g. I saw there is alsoadd_objects_to_pack
in addition toadd_streamed_objects_to_pack
) - [ ] Return hashes and file contents from the source repository in batches to avoid memory overloads and improve performance, as well as implement safety measures
- [ ] Proper benchmarking on the time savings that streaming to
packed
can provide (only timing the repository file additions) - [ ] Logic that somewhat automatically determines where to stream based on the total file size and/or number of repository files that are being imported (while still being able to manually overwrite it?)
- [ ] Properly check if the profile is being locked during the operation (this is something I have just ignored until now. It might already automatically being taken care of by the existing infrastructure?)
- [ ] Ensure that all repository connections are properly being opened and closed, e.g. by using the relevant context managers
Also note that I added the verdi profile flush
command as a convenience command to clean the data from a profile, which I use now during development. This can be moved into a separate PR if we deem it useful, or eventually be removed from this PR if not.
Pinging @khsrali as he voiced his interest in being involved in the disk_objectstore
, @agoscinski as we looked into this together for a while, @mbercx to keep in the loop, and @sphuber and @giovannipizzi with expertise on the backend implementations who can likely provide crucial pointers. I'm still familiarizing myself with the backend repository implementations and the disk_objectstore
, so any such pointers are more than welcome. I'll keep working on these to-dos on my branch, updating the PR here along the way.