dpdata icon indicating copy to clipboard operation
dpdata copied to clipboard

feat(fennol): add FeNNol format support with MultiSystems capability for ML training data

Open Copilot opened this issue 4 months ago • 0 comments

This PR implements support for the FeNNol format, enabling dpdata users to export both single LabeledSystem and multiple systems via MultiSystems to FeNNol's pickle format for machine learning training.

Overview

FeNNol is a machine learning framework that requires data in a specific pickle format with training/validation splits. This implementation adds a new format plugin that converts dpdata systems to the required structure, with support for combining multiple different systems into a single training file.

Key Features

  • Format Registration: Adds fennol format to dpdata's format registry
  • Single System Export: Enables system.to("fennol", "output.pkl") for any LabeledSystem
  • MultiSystems Support: Enables multi_systems.to("fennol", "output.pkl") to combine different systems into the same file
  • Proper Data Structure: Generates pickle files with the required structure:
    {
        'training': [...],      # List of training structures
        'validation': [...],    # List of validation structures  
        'description': '...'    # Metadata description
    }
    
  • Required Fields: Each structure contains the FeNNol-required fields:
    • species: List of atomic species/elements
    • coordinates: Atomic positions in Å
    • formation_energy: Energy in kcal/mol
    • shifted_energy: Energy in kcal/mol (same as formation_energy)
    • forces: Atomic forces in kcal/mol/Å
    • system_name: Name of the originating system (for MultiSystems tracking)

Unit Conversions

The plugin automatically handles unit conversions to match FeNNol's expected units:

  • Energy: eV → kcal/mol (factor: ~23.06)
  • Forces: eV/Å → kcal/mol/Å (factor: ~23.06)
  • Coordinates: Å → Å (no conversion needed)

Usage Examples

import dpdata

# Single system export
ls = dpdata.LabeledSystem("OUTCAR", fmt="vasp/outcar")
ls.to("fennol", "data.pkl")

# Multiple systems combined into single file
ls1 = dpdata.LabeledSystem("system1/OUTCAR", fmt="vasp/outcar")
ls2 = dpdata.LabeledSystem("system2/OUTCAR", fmt="vasp/outcar")
ms = dpdata.MultiSystems(ls1, ls2)
ms.to("fennol", "combined_data.pkl")

# Custom training/validation split
ms.to("fennol", "data.pkl", train_size=0.9)

Testing

Comprehensive test suite includes:

  • Basic export functionality with synthetic data
  • Custom training/validation split ratios
  • Edge cases (single frame, all training data)
  • MultiSystems export combining multiple different systems
  • Unit conversion verification
  • Integration testing with real system data

All tests pass and the implementation follows project linting standards.

Fixes #876.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot avatar Aug 29 '25 10:08 Copilot