sisl
sisl copied to clipboard
MD return values -- what should they be
The problem is that we need to converge the returned values in some way.
Currently some methods returns
- lists
- collection
- geometrycollection
Now what should we do about multiple returns. See below for two ideas, which would we prefer.
Things to consider:
-
easy to work with
-
get performant code (ideally they could be
(md, ...)
arrays) -
consistency across the outputs
I've addressed all comments except this one as I am not sure we want/need to unpack. For instance, the current behaviour seems logical to me:
traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
print(traj[i].force)
# or
for step in traj:
print(step.force)
You would prefer this?
traj = stdoutSileVASP(f).read_trajectory[:3]()
for i in range(3):
print(traj.force[i])
# or
for F in traj.force:
print(F)
Originally posted by @tfrederiksen in https://github.com/zerothi/sisl/pull/573#discussion_r1235576392
List of actions:
- [x] #589 remove
Collection
and friends, users now need to manually write many geometries - [ ] Dataset attempt
@pfebrer your inputs here would be valuable, I know you have commented on something like this before, here a more targeted issue.
I think lists are easy and intuitive for multiple read_*
calls. As long as the individual functions are clear what a single call produces (Geometry
, PropertyDict
, ndarray
, etc) I would let the user do the postprocessing (or provide the postprocess
function). Are there any particular concerns with this?
I think the main idea about a container would be that it can allow some arbitrary functions.
For instance one can do:
gcoll = something.read_geometry[:]()
sub_gcoll = gcol.applymap(Geometry.sub, atoms=[1, 2, 3])
to extract some atoms for all geometries in a new collection.
The Collection was added in #546, should we revert it and simply return the list?
I guess I do not see a clear advantage, rather some additional complexity without a clear use case.
With lists one can achieve the mapping with
[g.sub([1, 2, 3]) for g in gcoll]
For trajectories I think the best would be to introduce an extra dimension (MD step) in arrays like the coordinates. It would make everything simpler and more efficient. But as you said perhaps sisl
shouldn't focus on supporting trajectories and it should be managed externally.
For general collections of geometries, I don't know, because I have no use case in mind. Unless you have a clear use case, I believe lists are fine for now. Whatever Collection
class that is implemented in the future will probably have all the functionality that a list has (i.e. it will be iterable, you will be able to retreive items using indices...) so the probability of backward compatibility problems is very low.
I agree that if the only purpose of this collection is to map functions it is probably not worth it. I would wait to have some more interesting usage in mind before converting this list to a custom class. The list can always be upgraded in the future, instead needing to "downgrade" from custom class to list will most likely cause problems.
So, in the first message of this issue, for trajectories I would prefer something similar to the second snippet. But perhaps that should come as an output of a method called read_trajectory
, e.g. read_trajectory[:]()
, not just the general read_geometry[:]()
.
Every time I deal with trajectories I like to organize my data in xarray.Dataset
s because they allow me to do lots of operations easily. So traj
in the snippet could be a Dataset
, then the amount of things you need to implement in sisl
gets very reduced. For example, with a Dataset
you could also easily loop over all the data by MD step as in the first snippet if you want.
Ok, I see. Lets delete the Collection class.
I can see the benefit of xarray, perhaps now is the time to step into that regime. The only problem with the dataset is when requesting data from MD + SCF, then there are variable SCF steps for each MD, and hence it might be a bit problematic in that case? There some other solution is necessary. Perhaps a list [MD] of datasets [SCF].
My proposal would then be to:
- read_geometry returns a list
- read_* returns a dataset where applicable, otherwise a list of dataset, datasets should ideally be complete with attributes describing origin, full details etc.
I didn't know, but xarray
can have sparse arrays as variables.
See the example here: https://docs.xarray.dev/en/stable/internals/duck-arrays-integration.html#integrating-with-duck-arrays
And in principle one can generate the sparse dataset from a dataframe: https://docs.xarray.dev/en/stable/generated/xarray.Dataset.from_dataframe.html#xarray.Dataset.from_dataframe
So that would be suitable for the SCF data.
When I have time I can try to play with it in outSileSiesta.read_scf
, because I don't know how it would actually work.
I am not sure that would be great going forward. That would mean that all scf data is contained in a sparse array, which isn't particularly fast.
I think it would be wiser to do a list of Dataset, for easier interaction. I might be wrong, but it just feels weird to use a sparse array to have consecutive data that is truncated at different levels...
Hmm I don't know, because with a list of Datasets it is not easy to do operations across MD steps.
I don't know if the sparse dataset is a good choice, but a pandas dataframe for example seems better than a list of datasets.
Perhaps one should just not be able to extract MD + SCF at the same time, either MD or SCF.
Hmm I don't know, I have used the outSileSiesta.read_scf(as_dataframe=True)
and it worked great for me.
Hmm.. I don't know, yeah probably fine with that then. :) We should just have some pretty simple rules so users don't need to read documentation all the time.