aiida-core icon indicating copy to clipboard operation
aiida-core copied to clipboard

Stream directly to `packed` during `verdi archive import`

Open GeigerJ2 opened this issue 2 months ago • 3 comments

This is a very early proof-of-concept implementation of streaming repository files directly to packed during an archive import.

Especially for large archives, in particular, with many (small) repository files, this could provide significant speed-up as it reduces the required file open and close operations when writing to the target repository. While many such operations to obtain the streams from the ZipfileBackendRepository can probably not be avoided, they can be significantly reduced by streaming to packed instead of loose, as the latter would require the same number again as for reading the contents, while the former would require just one per pack (~4gb file).

I just ran the command for the acwf-verification_unaries-verification-PBE-v1-results_quantum_espresso-SSSP-1.3-PBE-precision.aiida archive (2gb, ~60k repository files) and the MC3D-provenance.aiida archive (12gb, ~840k repository files) In the first case, the total runtime of the verdi archive import command was 176s and 107s when streaming to loose or packed respectively, while in the second case they were 787s and 728s. Here, I was expecting a larger speed-up, but this might originate from the non-ideal preliminary implementation and benchmarking. Overall, I think this could be a promising feature when done properly.

A few notes regarding the implementation:

  • I first obtain all hashes from the source repository (ZipFileBackendRepository) via repository_from.list_objects() and then retrieve the contents as BytesIO via repository_from.get_object_content(). I didn't use the repository_from.iter_object_streams() as that returns ZipExtFiles which couldn't be opened outside the scope of a context manager, and, importantly, when passing to Container.add_streamed_objects_to_pack().
  • While absolutely necessary, no validation on the size is currently being done. Therefore, all streams are opened and in memory. This works on the PSI workstation with ~250gb of memory, but would otherwise crash most normal workstations. I haven't found a method of the ZipfileBackendRepository that returns the hashes and contents in batches (rather than one by one, such as iter_object_streams()) so that might be something we have to implement from scratch for this feature, either based on the number of files, or their total size for each batch, using sensible defaults.

The remaining todos to have a proper implementation (ties in with the notes above):

  • [ ] Use more suitable ZipFileBackendRepository and disk_objectstore Container methods if applicable and available (e.g. I saw there is also add_objects_to_pack in addition to add_streamed_objects_to_pack)
  • [ ] Return hashes and file contents from the source repository in batches to avoid memory overloads and improve performance, as well as implement safety measures
  • [ ] Proper benchmarking on the time savings that streaming to packed can provide (only timing the repository file additions)
  • [ ] Logic that somewhat automatically determines where to stream based on the total file size and/or number of repository files that are being imported (while still being able to manually overwrite it?)
  • [ ] Properly check if the profile is being locked during the operation (this is something I have just ignored until now. It might already automatically being taken care of by the existing infrastructure?)
  • [ ] Ensure that all repository connections are properly being opened and closed, e.g. by using the relevant context managers

Also note that I added the verdi profile flush command as a convenience command to clean the data from a profile, which I use now during development. This can be moved into a separate PR if we deem it useful, or eventually be removed from this PR if not.

Pinging @khsrali as he voiced his interest in being involved in the disk_objectstore, @agoscinski as we looked into this together for a while, @mbercx to keep in the loop, and @sphuber and @giovannipizzi with expertise on the backend implementations who can likely provide crucial pointers. I'm still familiarizing myself with the backend repository implementations and the disk_objectstore, so any such pointers are more than welcome. I'll keep working on these to-dos on my branch, updating the PR here along the way.

GeigerJ2 avatar May 24 '24 16:05 GeigerJ2