Improvements to data dumping
This is a meta-issue to track possible improvements to and further To-dos for the mirror/dump feature of AiiDA data (PR #6723):
priority/critical-blocking
- [x] Rather than always exposing and passing around the full DumpConfig object, add the relevant parameters to each
dumpmethod of the AiiDA entities as kwargs - [x] Use relative paths in the Logger and the log JSON
-> Achieved via converting the absolute path entries to relative paths when the log file is written, as well as converting them back to absolute paths when it is read again. Directly storing the relative paths as entries of the logger seems non-trivial as every mirror has it's own
MirrorPathsattached. - [x] Group modifications:
- [x] Adding a node to a group should be reflected on successive runs of the command (Node
mtimedoesn't change). - [x] What is the behavior when a group is deleted, but not its nodes? This deletes the group directory, and, as the nodes are now ungrouped, they
- [x] shouldn't be in the dump output directory anymore at all, or
~be automatically moved to the
ungroupeddirectory -> not chosen/implemented right now~
- [x] shouldn't be in the dump output directory anymore at all, or
~be automatically moved to the
- [x] Sub-calculation/workflow that is part of a group should be dumped, even if
only-top-levelis selected - [x] When a group is copied, it should also be dumped (even if all the nodes are not new, and the same nodes that are also in the other group).
- [x] Adding a node to a group should be reflected on successive runs of the command (Node
- [x] What is the behavior for new calculations that were taken from the cache? -> Should be the same, as the resulting nodes are new nodes with new
mtime - [x] Dumping all the profile data (as it is basically currently the case) possibly not the desired behavior. Instead, similar to
verdi archive create, the default should be "empty", and the user specifies which entities to be dumped (options here: groups, time range, specific time, process type (or entry point?)). If nothing is provided, the command could tell the user about the various options. - [x] Don't create empty directories for unselected groups when running
verdi profile dump -G - [ ] Allow adding groups to the config file incrementally (how should the command pick this up? Keep config version history via hidden files, and compare if new group was added?)
- Tests
- [ ] Unit tests
- [x] Integration tests
- [ ] Benchmark tests
- [ ] For first dump operation, create hidden sandbox folder, and only move when operation completed and ran through succesfully
priority/nice-to-have
- [x] Keep track of symlinks and/or dumped node types in
MirrorLogger? - [x] Store modified time and size of directories in a metadata info file -> This can be checked in the verify option and pick up modification/deletion of files/directories (currently done in the main dump log JSON)
- [x] Write YAML file of all the config options in the top-level mirror directory after mirror operation (and allow reading from it), such that the same command can be applied reproducibly.
- [ ] Add data dumping (raw/rich)
- [ ] Have some additional verify options (could be at different levels/complexities, with different time requirements)
- [ ] What happens when a user deletes a sub-directory in the dumped output directory tree?
- [ ] One could also set the default permissions to read-only. However, the top-level directory should be easy to rm, mv, etc.
- [ ] Use graph traversal rules rather than recursive function for process mirror -> Not sure if the use cases exactly overlap
- [ ] For duplicate groups, one could issue a report about the size of data that is duplicated, or de-duplicated if symlinking is enabled
- [ ] User should be able to specify via a global option what to do when duplicate entities are encountered, e.g.,
--handle-duplicates - [ ] Add
--recursive-groupsoption: When we have groupsa/b/canda/b/d, and the user specifies that option anda/b, that should matcha/bexplicitly, and all subgroupsa/b/*, however, it should not matcha/b*. Also, passinga/b/should give the same behavior. - [ ]
Using config file settings ONLYadd a statement, why this is the case - [ ] If the user wants to dump a group via
verdi group dumpinside a profile directory, check one (or two) levels up for the config of a profile dump, and append to that log file, rather than the group one? Might get difficult, when using the same manager/orchestrator (the profile one, chosen to avoid code repetition). - [x] More fine-grained exception handling, and better behavior on capture (e.g., fail fully rather than returning empty lists?)
- [ ] Use a tree data structure to represent the state of the dumping, rather than a flat JSON file
- [ ] Add
verdi data core.remote dumpCLI endpoint to retrieve aRemoteData
requires-discussion
- [ ] Possibly use
GraphTraversalRulesto obtain sub-processes forverdi process dump
Notes
- There exists
create_file_hierarchyandserialize_file_hierarchyfixtures
Adding some future architectural improvements here for when I revisit the implementation
High Priority
1. Path Management Complexity
Problem: DumpPaths class violates Single Responsibility Principle by handling path generation, directory I/O, deletion, stats calculation, and business logic.
Origin: src/aiida/tools/_dumping/utils.py - DumpPaths class
Solution: Split into focused classes:
PathGenerator: Pure functions for path generationFileSystemManager: Handles mkdir, delete, stats operations- Use Strategy pattern for different path determination strategies (grouped vs ungrouped nodes)
2. Executor Hierarchy Issues
Problem: Awkward inheritance with CollectionDumpExecutor as base for both GroupDumpExecutor and ProfileDumpExecutor. ProcessDumpExecutor contains multiple unrelated responsibilities.
Origin: src/aiida/tools/_dumping/executors/ directory structure
Solution: Replace inheritance with composition:
class ProcessDumper:
def __init__(self, metadata_writer, repo_dumper, readme_generator):
...
class CollectionDumper:
def __init__(self, process_dumper, group_manager):
...
3. DumpChangeDetector Doing Too Much
Problem: Single class handles node querying, change detection, filtering, tracker interaction, and group membership logic.
Origin: src/aiida/tools/_dumping/detect.py - DumpChangeDetector class
Solution: Break into focused services:
NodeQueryService: Database queriesChangeDetectionService: Compare current vs tracked stateNodeFilterService: Apply behavioral filters
Medium Priority
4. DumpTracker Mixing Concerns
Problem: Combines registry management, file I/O, path updates, mapping storage, and time tracking in one class.
Origin: src/aiida/tools/_dumping/tracking.py - DumpTracker class
Solution: Separate persistence from domain logic:
- Keep
DumpRegistryas in-memory structure - Create
DumpRegistryPersistencefor save/load operations - Make
DumpTrackercoordinate registries without handling I/O
5. Type System Not Used Effectively
Problem: Heavy use of assert isinstance(self.config, ...) throughout codebase suggests config types aren't properly enforced.
Origin: Multiple files - engine.py, detect.py, executors
Solution: Use separate classes instead of Union types:
class GroupDumper:
def __init__(self, config: GroupDumpConfig): # Type guaranteed
self.config: GroupDumpConfig = config
6. DumpEngine as God Object
Problem: Engine orchestrates everything, knows all implementation details, makes all decisions.
Origin: src/aiida/tools/_dumping/engine.py - DumpEngine class
Solution: Use builder/factory pattern:
class DumpEngineBuilder:
def build_for_process(self, node, config) -> ProcessDumper: ...
def build_for_group(self, group, config) -> GroupDumper: ...
class DumpEngine:
def __init__(self, dumper: Dumper):
self.dumper = dumper
Low Priority
7. Config Inheritance Complexity
Problem: Multiple mixins create complex inheritance tree, making it hard to understand which options apply where.
Origin: src/aiida/tools/_dumping/config.py - mixin classes
Solution: Use composition with explicit config objects:
@dataclass
class ProfileDumpConfig:
time_filters: Optional[TimeFilters]
entity_filters: Optional[EntityFilters]
dump_mode: DumpMode
8. Inconsistent Error Handling
Problem: Some methods raise, others log and return None, some continue silently. No consistent error handling strategy.
Origin: Throughout codebase
Solution:
- Define custom exception hierarchy
- Document error handling strategy (critical: raise, recoverable: log + continue)
- Apply consistently across all modules
9. State Management Unclear
Problem: Unclear when state is cached vs recomputed, what current_mapping vs previous_mapping means, where state lives.
Origin: engine.py, detect.py, tracking.py
Solution: Make state explicit with immutable dataclasses:
@dataclass
class DumpState:
current_mapping: GroupNodeMapping
previous_mapping: Optional[GroupNodeMapping]
dump_times: DumpTimes
tracked_nodes: Set[str]
10. Testing Challenges
Problem: Tight coupling makes unit testing difficult, most tests require full database setup.
Origin: All classes with direct ORM dependencies
Solution: Use dependency injection with protocols:
class NodeRepository(Protocol):
def get_calculations(self, filters) -> List[CalculationNode]: ...
class DumpChangeDetector:
def __init__(self, node_repo: NodeRepository): # Can mock
...
11. Magic Strings and Constants
Problem: Registry names ('calculations', 'workflows', 'groups') used as strings throughout code.
Origin: Multiple files
Solution: Use enums:
class RegistryType(Enum):
CALCULATIONS = 'calculations'
WORKFLOWS = 'workflows'
GROUPS = 'groups'
12. Inconsistent Naming Conventions
Problem: Mixed use of orm_type, store_type, registry_name for same concept. Inconsistent method prefixes (_detect_* vs _get_* vs _query_*).
Origin: Throughout codebase
Solution: Establish and document naming conventions:
- Use consistent terminology for registries
- Define when to use
query_*,get_*,detect_*,find_*
13. Long Methods
Problem: Many methods exceed 50 lines (e.g., _dump_node_content, _check_log_and_determine_action, get_nodes).
Origin: executors/process.py, detect.py
Solution: Extract smaller, focused methods with clear single purposes and descriptive names.