sisl icon indicating copy to clipboard operation
sisl copied to clipboard

MD return values -- what should they be

Open zerothi opened this issue 1 year ago • 13 comments

The problem is that we need to converge the returned values in some way.

Currently some methods returns

  • lists
  • collection
  • geometrycollection

Now what should we do about multiple returns. See below for two ideas, which would we prefer.

Things to consider:

  • easy to work with

  • get performant code (ideally they could be (md, ...) arrays)

  • consistency across the outputs

            I've addressed all comments except this one as I am not sure we want/need to unpack. For instance, the current behaviour seems logical to me:
    
traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
    print(traj[i].force)
# or
for step in traj:
    print(step.force)

You would prefer this?

traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
    print(traj.force[i])
# or
for F in traj.force:
    print(F)

Originally posted by @tfrederiksen in https://github.com/zerothi/sisl/pull/573#discussion_r1235576392

List of actions:

  • [x] #589 remove Collection and friends, users now need to manually write many geometries
  • [ ] Dataset attempt

zerothi avatar Jun 20 '23 19:06 zerothi

@pfebrer your inputs here would be valuable, I know you have commented on something like this before, here a more targeted issue.

zerothi avatar Jun 20 '23 19:06 zerothi

I think lists are easy and intuitive for multiple read_* calls. As long as the individual functions are clear what a single call produces (Geometry, PropertyDict, ndarray, etc) I would let the user do the postprocessing (or provide the postprocess function). Are there any particular concerns with this?

tfrederiksen avatar Jun 20 '23 19:06 tfrederiksen

I think the main idea about a container would be that it can allow some arbitrary functions.

For instance one can do:

gcoll = something.read_geometry[:]()
sub_gcoll = gcol.applymap(Geometry.sub, atoms=[1, 2, 3])

to extract some atoms for all geometries in a new collection.

The Collection was added in #546, should we revert it and simply return the list?

zerothi avatar Jun 20 '23 19:06 zerothi

I guess I do not see a clear advantage, rather some additional complexity without a clear use case.

With lists one can achieve the mapping with

[g.sub([1, 2, 3]) for g in gcoll]

tfrederiksen avatar Jun 20 '23 20:06 tfrederiksen

For trajectories I think the best would be to introduce an extra dimension (MD step) in arrays like the coordinates. It would make everything simpler and more efficient. But as you said perhaps sisl shouldn't focus on supporting trajectories and it should be managed externally.

For general collections of geometries, I don't know, because I have no use case in mind. Unless you have a clear use case, I believe lists are fine for now. Whatever Collection class that is implemented in the future will probably have all the functionality that a list has (i.e. it will be iterable, you will be able to retreive items using indices...) so the probability of backward compatibility problems is very low.

I agree that if the only purpose of this collection is to map functions it is probably not worth it. I would wait to have some more interesting usage in mind before converting this list to a custom class. The list can always be upgraded in the future, instead needing to "downgrade" from custom class to list will most likely cause problems.

pfebrer avatar Jun 20 '23 21:06 pfebrer

So, in the first message of this issue, for trajectories I would prefer something similar to the second snippet. But perhaps that should come as an output of a method called read_trajectory, e.g. read_trajectory[:](), not just the general read_geometry[:]().

Every time I deal with trajectories I like to organize my data in xarray.Datasets because they allow me to do lots of operations easily. So traj in the snippet could be a Dataset, then the amount of things you need to implement in sisl gets very reduced. For example, with a Dataset you could also easily loop over all the data by MD step as in the first snippet if you want.

pfebrer avatar Jun 20 '23 21:06 pfebrer

Ok, I see. Lets delete the Collection class.

I can see the benefit of xarray, perhaps now is the time to step into that regime. The only problem with the dataset is when requesting data from MD + SCF, then there are variable SCF steps for each MD, and hence it might be a bit problematic in that case? There some other solution is necessary. Perhaps a list [MD] of datasets [SCF].

My proposal would then be to:

  • read_geometry returns a list
  • read_* returns a dataset where applicable, otherwise a list of dataset, datasets should ideally be complete with attributes describing origin, full details etc.

zerothi avatar Jun 21 '23 05:06 zerothi

I didn't know, but xarray can have sparse arrays as variables.

See the example here: https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html#integrating-with-duck-arrays

And in principle one can generate the sparse dataset from a dataframe: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.from_dataframe.html#xarray.Dataset.from_dataframe

So that would be suitable for the SCF data.

When I have time I can try to play with it in outSileSiesta.read_scf, because I don't know how it would actually work.

pfebrer avatar Jun 21 '23 09:06 pfebrer

I am not sure that would be great going forward. That would mean that all scf data is contained in a sparse array, which isn't particularly fast.

I think it would be wiser to do a list of Dataset, for easier interaction. I might be wrong, but it just feels weird to use a sparse array to have consecutive data that is truncated at different levels...

zerothi avatar Jun 21 '23 09:06 zerothi

Hmm I don't know, because with a list of Datasets it is not easy to do operations across MD steps.

I don't know if the sparse dataset is a good choice, but a pandas dataframe for example seems better than a list of datasets.

pfebrer avatar Jun 21 '23 10:06 pfebrer

Perhaps one should just not be able to extract MD + SCF at the same time, either MD or SCF.

zerothi avatar Jun 21 '23 10:06 zerothi

Hmm I don't know, I have used the outSileSiesta.read_scf(as_dataframe=True) and it worked great for me.

pfebrer avatar Jun 21 '23 10:06 pfebrer

Hmm.. I don't know, yeah probably fine with that then. :) We should just have some pretty simple rules so users don't need to read documentation all the time.

zerothi avatar Jun 21 '23 10:06 zerothi