cmlkit
cmlkit copied to clipboard
Edge case in Dataset.compute_dataset_info
This is a minor concern, especially if Dataset is being phased out. When creating a Dataset with only single-atom geometries, dists
is a list of empty arrays and min_distance
and max_distance
fail. A real-world example of this might be energy-volume curves with FCC primitive cells.
https://github.com/sirmarcel/cmlkit/blob/539021520de316e94c1490f957f923ab41bb18c3/cmlkit/dataset/dataset.py#L548-L552
Toy example:
from ase import Atoms
from cmlkit.dataset import Dataset
geometries = [Atoms('Au', cell=[3, 3, 3], pbc=True),
Atoms('Au', cell=[4, 4, 4], pbc=True),
Atoms('Au', cell=[5, 5, 5], pbc=True),]
Dataset.from_Atoms(geometries)
ValueError Traceback (most recent call last)
<ipython-input-64-25473c66c8a8> in <module>
5 Atoms('Au', cell=[3, 3, 3], pbc=True),
6 Atoms('Au', cell=[4, 4, 4], pbc=True),]
----> 7 Dataset.from_Atoms(geometries)
~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in from_Atoms(cls, atoms, p, name, desc, splits)
288 name=name,
289 desc=desc,
--> 290 splits=splits,
291 )
292
~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in __init__(self, z, r, b, p, name, desc, splits, _info, _hash, _geom_hash)
169 self.info = _info
170 else:
--> 171 self.info = self.get_info()
172
173 # compute auxiliary info that we need to convert properties
~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in get_info(self)
218 def get_info(self):
219 """Compute information on dataset."""
--> 220 return compute_dataset_info(self)
221
222 def get_hash(self):
~/PycharmProjects/cmlkit/cmlkit/dataset/dataset.py in compute_dataset_info(dataset)
549 qmmlpack.lower_triangular_part(qmmlpack.distance_euclidean(rr), -1) for rr in r
550 ]
--> 551 i["min_distance"] = min([min(d) for d in dists if len(d) > 0])
552 i["max_distance"] = max([max(d) for d in dists if len(d) > 0])
553
ValueError: min() arg is an empty sequence
Thanks for reporting! This is indeed an edge case, but I'll try to fix it when I overhaul Dataset
.