kinisi icon indicating copy to clipboard operation
kinisi copied to clipboard

Improvements to `species`

Open arm61 opened this issue 5 months ago • 11 comments

Create a species type thing that stores the indices and charges of things (charges only for conductivity, obviously).

This is a protype that might work:

species = sc.DataGroup({'jeff': sc.DataArray(data=sc.array(values=[[0, 1], [2, 3], [4, 5]], dims=['particles', 'atom']), coords={'charges': sc.ones(dims=['particles'], shape=(3,))}),
                        'amy': sc.DataArray(data=sc.array(values=[[6, 7, 8], [9, 10, 11], [12, 13, 14]], dims=['particles', 'atom']), coords={'charges': sc.ones(dims=['particles'], shape=(3,)) * 2})})

If the user passes a string as the specie, kinisi should build this sort of object and work with it.

arm61 avatar Oct 16 '25 09:10 arm61

This would lead to a 2.1 release.

arm61 avatar Oct 16 '25 09:10 arm61

Waiting on #185

arm61 avatar Oct 16 '25 09:10 arm61

Breaking down:

  • [ ] Species class
  • [ ] Data group for different species
  • [ ] refactor of internals to do the center of mass

arm61 avatar Oct 16 '25 10:10 arm61

Few considerations that came to my mind.

  • we could call Specie a ParticleGroup ?
  • how to treat e.g. dynamic charges that come from the trajectory data?

Issue with data availability.

We were thinking about these

Species.from_type("Li")              # Lookup from database
Species.from_indices([0,1,2], ...)   # Manual specification  
Species.from_data_array(...)         # Already-structured data

but the issue is, the internal state of these objects would differ a lot, because for the from_type we can't get to the indices without going through the data, so we would have to store all these internals in some obfuscated way. An alternative could be, to have different classes

class ParticleGroup: ...

class AtomicSpecies:
    element: str
    charge: float|None

class MolecularSpecies:
   indices: list[int]
   charges: float | list[float] | None

and internally convert them via AtomicSpecies(...).to_particle_group(data: ...) -> ParticleGroup where data would be some trajectory data, where we can get the indices from for AtomicSpecies.

For the actual configuration setup, we would still have a union type being particle_groups = {"Li": AtomicSpecies("Li"), "CO32-": MolecularSpecies(indices=[[1, 2, 3, 4], ...], charges=-2), "H2O": ParticleGroup(...). where ParticleGroup could / should be scipp.

or written out

config = {
    "particle_groups": {
        "Li": AtomicSpecies("Li", charge=1.0),
        "CO3": MolecularSpecies(
            indices=[[1, 2, 3, 4], [5, 6, 7, 8]],  # 2 molecules
            masses=[12.01, 16.0, 16.0, 16.0],
            charges=[-2.0, -0.8, -0.6, -0.6]  # Per-atom
        ),
        "H2O": MolecularSpecies(
            indices=[[9, 10, 11], [12, 13, 14]],
            masses=[16.0, 1.008, 1.008],
            charges=None
        ),
    }
}

PythonFZ avatar Oct 16 '25 11:10 PythonFZ

I had it in my head that this:

Species.from_type("Li")              # Lookup from database
Species.from_indices([0,1,2], ...)   # Manual specification  
Species.from_data_array(...)         # Already-structured data

would be internal. So, if the user provided a string, kinisi would do the lookup (which is how it works currently). What is the reason not to use that approach?

arm61 avatar Oct 21 '25 12:10 arm61

So, if the user provided a string, kinisi would do the lookup (which is how it works currently). What is the reason not to use that approach?

How would that look like in the input dict?

PythonFZ avatar Oct 21 '25 17:10 PythonFZ

The same as it is at the moment, but species_indices would go away and one could pass the data array directly as species. This might be different from what @jd15489 had in mind though.

arm61 avatar Oct 23 '25 08:10 arm61

Could you give a dict example?

From

molecules = [[288, 289, 290, 291, 292, 293],
             [284, 295, 296, 297, 298, 299]]
params = {
   'specie': None,
   'specie_indices': sc.array(dims=['particle', 'atoms in particle'], values=molecules, unit=sc.Unit('dimensionless')),
...
}

to

molecules = [[288, 289, 290, 291, 292, 293],
             [284, 295, 296, 297, 298, 299]]
params = {
   "particle_groups": {"mol": sc.array(dims=['particle', 'atoms in particle'], values=molecules, unit=sc.Unit('dimensionless'))}
...
}

OR

molecules = [[288, 289, 290, 291, 292, 293],
             [284, 295, 296, 297, 298, 299]]
params = {
   "particle_groups": {"mol": "Li"}  # should the Union str|sc.array be used to infer the type? Have it implict instead of explict.
...
}

How to provide the charges? Should "Li" also be a sc.array ?

Small note on the suggested API above, the charge and mass should be sc.scaler or sc.array respectively

AtomicSpecies("Li", charge=sc.scalar(1, unit='charge'))
MolecularSpecies(..., charges=sc.array(...), masses = sc.array(...)

Having classes like AtomicSpecies could allow e.g. to introduce further logic, which the union type won't be able to. e.g. MoleculeSpeciesFromSmiles(smiles="CCO") to compute the diffusion for all molecules that match the CCO smiles, i.e. ethanol.

PythonFZ avatar Oct 23 '25 08:10 PythonFZ

Ahhh, I wasn't thinking about charges. Sorry, not enough sleep evidently. Let me think about it this week.

arm61 avatar Oct 28 '25 11:10 arm61

I have been thinking about this.

My thought is that we should consolidate these arguments to Parser: [coords, specie_indices, drift_indices, & masses] We instead pass two lists of ParticleGroup objects, one for diffusion species and one for drift species.

We then change the Parser subclasses to handle these ParticleGroup objects. For example, MDAnalysisParser should have methods for creating ParticleGroup objects from the universe and the user's input. This subParser would also accept ParticleGroup objects that don't contain coords, and pass them through adding coords along the way.

This might be a substantial rewrite, but I think it adds flexibility and reduces the number of objects we are having to manual pass between classes.

jd15489 avatar Nov 04 '25 10:11 jd15489

Some notes from the above implementation (SubParser is the inherited Parser, i.e., MDAnalysisParser) from a whiteboard conversation between @jd15489 and I.

Possible input types are:

  • str
  • list (backwards compatibility for species_indices (both this list input and the species_indices keyword would be immediately deprecated for removal in a future point release.
  • new ParticleGroup which is a superclass of sc.DataGroup

In the SubParser, these are constructed into a ParticleGroup; the SubParser has access to the trajectory, so this is fine. This approach would mean that a ParticleGroup could be constructed with a NumPy array as the coords, which is an outstanding issue.

This ParticleGroup should be a sc.DataArray, and then if there are two molecule types of different lengths of atoms, these will be two ParticleGroups stored together in a sc.DataGroup.

arm61 avatar Nov 06 '25 12:11 arm61