Improvements to data dumping

Open GeigerJ2 opened this issue 8 months ago • 1 comments

This is a meta-issue to track possible improvements to and further To-dos for the mirror/dump feature of AiiDA data (PR #6723):

priority/critical-blocking

[x] Rather than always exposing and passing around the full DumpConfig object, add the relevant parameters to each dump method of the AiiDA entities as kwargs
[x] Use relative paths in the Logger and the log JSON -> Achieved via converting the absolute path entries to relative paths when the log file is written, as well as converting them back to absolute paths when it is read again. Directly storing the relative paths as entries of the logger seems non-trivial as every mirror has it's own MirrorPaths attached.
[x] Group modifications:
- [x] Adding a node to a group should be reflected on successive runs of the command (Node mtime doesn't change).
- [x] What is the behavior when a group is deleted, but not its nodes? This deletes the group directory, and, as the nodes are now ungrouped, they
  - [x] shouldn't be in the dump output directory anymore at all, or ~be automatically moved to the ungrouped directory -> not chosen/implemented right now~
- [x] Sub-calculation/workflow that is part of a group should be dumped, even if only-top-level is selected
- [x] When a group is copied, it should also be dumped (even if all the nodes are not new, and the same nodes that are also in the other group).
[x] What is the behavior for new calculations that were taken from the cache? -> Should be the same, as the resulting nodes are new nodes with new mtime
[x] Dumping all the profile data (as it is basically currently the case) possibly not the desired behavior. Instead, similar to verdi archive create, the default should be "empty", and the user specifies which entities to be dumped (options here: groups, time range, specific time, process type (or entry point?)). If nothing is provided, the command could tell the user about the various options.
[x] Don't create empty directories for unselected groups when running verdi profile dump -G
[ ] Allow adding groups to the config file incrementally (how should the command pick this up? Keep config version history via hidden files, and compare if new group was added?)
Tests
- [ ] Unit tests
- [x] Integration tests
- [ ] Benchmark tests
[ ] For first dump operation, create hidden sandbox folder, and only move when operation completed and ran through succesfully

priority/nice-to-have

[x] Keep track of symlinks and/or dumped node types in MirrorLogger?
[x] Store modified time and size of directories in a metadata info file -> This can be checked in the verify option and pick up modification/deletion of files/directories (currently done in the main dump log JSON)
[x] Write YAML file of all the config options in the top-level mirror directory after mirror operation (and allow reading from it), such that the same command can be applied reproducibly.
[ ] Add data dumping (raw/rich)
[ ] Have some additional verify options (could be at different levels/complexities, with different time requirements)
[ ] What happens when a user deletes a sub-directory in the dumped output directory tree?
[ ] One could also set the default permissions to read-only. However, the top-level directory should be easy to rm, mv, etc.
[ ] Use graph traversal rules rather than recursive function for process mirror -> Not sure if the use cases exactly overlap
[ ] For duplicate groups, one could issue a report about the size of data that is duplicated, or de-duplicated if symlinking is enabled
[ ] User should be able to specify via a global option what to do when duplicate entities are encountered, e.g., --handle-duplicates
[ ] Add --recursive-groups option: When we have groups a/b/c and a/b/d, and the user specifies that option and a/b, that should match a/b explicitly, and all subgroups a/b/*, however, it should not match a/b*. Also, passing a/b/ should give the same behavior.
[ ] Using config file settings ONLY add a statement, why this is the case
[ ] If the user wants to dump a group via verdi group dump inside a profile directory, check one (or two) levels up for the config of a profile dump, and append to that log file, rather than the group one? Might get difficult, when using the same manager/orchestrator (the profile one, chosen to avoid code repetition).
[x] More fine-grained exception handling, and better behavior on capture (e.g., fail fully rather than returning empty lists?)
[ ] Use a tree data structure to represent the state of the dumping, rather than a flat JSON file
[ ] Add verdi data core.remote dump CLI endpoint to retrieve a RemoteData

requires-discussion

[ ] Possibly use GraphTraversalRules to obtain sub-processes for verdi process dump

Notes

There exists create_file_hierarchy and serialize_file_hierarchy fixtures

Apr 10 '25 14:04 GeigerJ2

Adding some future architectural improvements here for when I revisit the implementation

High Priority

1. Path Management Complexity

Problem: DumpPaths class violates Single Responsibility Principle by handling path generation, directory I/O, deletion, stats calculation, and business logic.

Origin: src/aiida/tools/_dumping/utils.py - DumpPaths class

Solution: Split into focused classes:

PathGenerator: Pure functions for path generation
FileSystemManager: Handles mkdir, delete, stats operations
Use Strategy pattern for different path determination strategies (grouped vs ungrouped nodes)

2. Executor Hierarchy Issues

Problem: Awkward inheritance with CollectionDumpExecutor as base for both GroupDumpExecutor and ProfileDumpExecutor. ProcessDumpExecutor contains multiple unrelated responsibilities.

Origin: src/aiida/tools/_dumping/executors/ directory structure

Solution: Replace inheritance with composition:

class ProcessDumper:
    def __init__(self, metadata_writer, repo_dumper, readme_generator):
        ...

class CollectionDumper:
    def __init__(self, process_dumper, group_manager):
        ...

3. DumpChangeDetector Doing Too Much

Problem: Single class handles node querying, change detection, filtering, tracker interaction, and group membership logic.

Origin: src/aiida/tools/_dumping/detect.py - DumpChangeDetector class

Solution: Break into focused services:

NodeQueryService: Database queries
ChangeDetectionService: Compare current vs tracked state
NodeFilterService: Apply behavioral filters

Medium Priority

4. DumpTracker Mixing Concerns

Problem: Combines registry management, file I/O, path updates, mapping storage, and time tracking in one class.

Origin: src/aiida/tools/_dumping/tracking.py - DumpTracker class

Solution: Separate persistence from domain logic:

Keep DumpRegistry as in-memory structure
Create DumpRegistryPersistence for save/load operations
Make DumpTracker coordinate registries without handling I/O

5. Type System Not Used Effectively

Problem: Heavy use of assert isinstance(self.config, ...) throughout codebase suggests config types aren't properly enforced.

Origin: Multiple files - engine.py, detect.py, executors

Solution: Use separate classes instead of Union types:

class GroupDumper:
    def __init__(self, config: GroupDumpConfig):  # Type guaranteed
        self.config: GroupDumpConfig = config

6. DumpEngine as God Object

Problem: Engine orchestrates everything, knows all implementation details, makes all decisions.

Origin: src/aiida/tools/_dumping/engine.py - DumpEngine class

Solution: Use builder/factory pattern:

class DumpEngineBuilder:
    def build_for_process(self, node, config) -> ProcessDumper: ...
    def build_for_group(self, group, config) -> GroupDumper: ...

class DumpEngine:
    def __init__(self, dumper: Dumper):
        self.dumper = dumper

Low Priority

7. Config Inheritance Complexity

Problem: Multiple mixins create complex inheritance tree, making it hard to understand which options apply where.

Origin: src/aiida/tools/_dumping/config.py - mixin classes

Solution: Use composition with explicit config objects:

@dataclass
class ProfileDumpConfig:
    time_filters: Optional[TimeFilters]
    entity_filters: Optional[EntityFilters]
    dump_mode: DumpMode

8. Inconsistent Error Handling

Problem: Some methods raise, others log and return None, some continue silently. No consistent error handling strategy.

Origin: Throughout codebase

Solution:

Define custom exception hierarchy
Document error handling strategy (critical: raise, recoverable: log + continue)
Apply consistently across all modules

9. State Management Unclear

Problem: Unclear when state is cached vs recomputed, what current_mapping vs previous_mapping means, where state lives.

Origin: engine.py, detect.py, tracking.py

Solution: Make state explicit with immutable dataclasses:

@dataclass
class DumpState:
    current_mapping: GroupNodeMapping
    previous_mapping: Optional[GroupNodeMapping]
    dump_times: DumpTimes
    tracked_nodes: Set[str]

10. Testing Challenges

Problem: Tight coupling makes unit testing difficult, most tests require full database setup.

Origin: All classes with direct ORM dependencies

Solution: Use dependency injection with protocols:

class NodeRepository(Protocol):
    def get_calculations(self, filters) -> List[CalculationNode]: ...

class DumpChangeDetector:
    def __init__(self, node_repo: NodeRepository):  # Can mock
        ...

11. Magic Strings and Constants

Problem: Registry names ('calculations', 'workflows', 'groups') used as strings throughout code.

Origin: Multiple files

Solution: Use enums:

class RegistryType(Enum):
    CALCULATIONS = 'calculations'
    WORKFLOWS = 'workflows'
    GROUPS = 'groups'

12. Inconsistent Naming Conventions

Problem: Mixed use of orm_type, store_type, registry_name for same concept. Inconsistent method prefixes (_detect_* vs _get_* vs _query_*).

Origin: Throughout codebase

Solution: Establish and document naming conventions:

Use consistent terminology for registries
Define when to use query_*, get_*, detect_*, find_*

13. Long Methods

Problem: Many methods exceed 50 lines (e.g., _dump_node_content, _check_log_and_determine_action, get_nodes).

Origin: executors/process.py, detect.py

Solution: Extract smaller, focused methods with clear single purposes and descriptive names.

Oct 06 '25 13:10 GeigerJ2