matbench
matbench copied to clipboard
Possible additions and modifications for matbench v1.0
migrated from https://github.com/hackingmaterials/automatminer/issues/294
Additions
-
Composition/UV-Vis measurements from 179072 metal oxides from here
-
~~165 Thermoelectric zT measurements at 300K Id 150888 mentioned here seemingly gathered from here~~
-
~~160 thermoelectric pF measurements compared with experimental ID: 152463 from here~~
-
ucsb thermoelectrics
which was recently added to matminer (2021.06.07) -
786 materials, 26 with curie point beyond 400k for 2D ferromagnets from here
-
~~expt_formation_enthalpy from matminer.datasets - @rkingsbury has already done a lot of work on this as far as cleaning for another project!~~
-
expt_formation_enthalpy_kingsbury
from matminer -
composition hv set from @sgbaird
-
ternary nitrides set known as
tholander_nitrides
in matminer
Amendments
- dielectric must remove n=1 entries which have since been deprecated/removed by MP
- expt_gap and expt_is_metal Cd2Te should be examined
- matbench_mp_* datasets should be remade to reflect SCAN functional and new energies
- remove
log10(*_VRH)=0.0
(*_VRH=1.0)
entries frommatbench_log_*_vrh
datasets, as per @CompRhys comment; update data with newest from mp - remove isolated atoms from MP datasets as per @CompRhys comment
- Update
steels_yield
to include more information about the samples (are they multiphase or single phase austentite) as well as include the temperature as an extra variable for prediction (for more info see original source dataset no. 153092 and potential predecessor dataset no. 114165
Structural changes
- create tests that reflect use case: e.g., jdft2d should be evaluated on ability to identify mats with low exfoliation energy, glass should be organized according to chemical space, etc.
- formation energy validation should be changed to reflect this article
- allow for use of extra features (which can and should be used for prediction, esp. for experimental samples, e.g. Temperature in an updated
steels_yield
)
Evaluation changes
- Include LOCO-CV or some method of clustering groups
Major extras and/or new benchmarks
- Adaptive design/in the loop
- Structure generation
- Include time of evaluation as a metric (as extremely expensive algorithms can be counterproductive especially for in the loop methods)
- Additions of new more "industrial" datasets including more diverse classes of materials, e.g. MOFs, as well as degrees of freedom or extra information (e.g., Temperature)
@janosh and I have been looking carefully at the log_kvrh
and log_gvrh
datasets and there are a few edge cases/things we found. There are some materials in both of these datasets where the relevant moduli are zero i.e. they're incompressible fluids?
We stumbled across these zero moduli materials whilst trying to debug a cgcnn implementation with a small cut-off radius ~ 4 \AA and that in some structures none of the sites had any neighbours within this cut-off. Strangely the fact that all the atoms were isolated did not cause our model to crash (the model does crash if a single atom is isolated which we were initially looking at). As 4 \AA is the default in MEGNet (which has a benchmark result for these datasets) it might be important for the structure-based datasets to specify a minimum cut-off radius to be able to get valid crystal graphs for all the entries.
@CompRhys @janosh this is great you noticed this! Could you paste in the matbench-ids where this is the case? I believe they were checked but I may have missed some strange cases.
Strangely, we did not run into the same issues with CGCNN/MEGNet (at least to the best of my knowledge right now, @Qi-max actually ran the *graphnet training). Let's investigate this more?
@ardunn The indices of entries with zero bulk modulus in log_kvrh
(14 in total) are
1149, 1163, 2116, 2186, 3851, 4776, 4816, 4822, 6631, 8446, 9420, 10024, 10676, 10912
and with zero shear modulus in log_gvrh
(31 in total)
58, 1149, 1163, 1282, 1440, 1548, 1931, 2116, 2186, 2221, 2729, 4659, 4776, 4816, 4820, 4822, 6032, 6631, 6632, 8231, 8377, 9107, 9420, 9440, 9458, 9550, 9723, 9762, 9978, 10024, 10912
Here's the Colab notebook that looks at the data:
https://colab.research.google.com/drive/19QOM8i8ScM1fQGAt53SIMIEG6gn09RjN
@ardunn It would be nice if matbench datasets had a 3rd column source_id
as that would make it much easier to connect the composition/structure to other available properties.
@ardunn It would be nice if matbench datasets had a 3rd column
source_id
as that would make it much easier to connect the composition/structure to other available properties.
+1 for this!
* create tests that reflect use case: e.g., jdft2d should be evaluated on ability to identify mats with low exfoliation energy, glass should be organized according to chemical space, etc. * formation energy validation should be changed to reflect [this article](https://www.nature.com/articles/s41524-020-00362-y)
Both great ideas; once we have sample-level predictions for the leaderboard there is plenty of nice exploration that could be done here, with automatic ranking against task-specific metrics as well as the boring standard ones.
We made an mistake here - the data is the log modulus so a log modulus of zero corresponds to a modulus of 1 GPa which isn't unphysical (or at least not in the way I thought). However, might be worth excluding them anyway from reading the original workflow manuscript (https://www.nature.com/articles/sdata20159) as
Conditions i) and ii) are selected based on an empirical observation that the most compliant known metals have shear and bulk moduli larger than approximately 2 GPa. Hence if our calculations yield results below 2 GPa for either the Reuss averages [50] (a lower bound estimate) of K or G, these results might be correct but deserve additional attention.
The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/
The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/
I agree.
@ardunn It would be nice if matbench datasets had a 3rd column
source_id
as that would make it much easier to connect the composition/structure to other available properties.
In principle I don't have any problem adding these, but ...
I need to think some more about this, because these datasets serve as a "snapshot" of various online repositories such as MP. So for MP entries, a specific property is tied to a specific computation (not an mpid persay), and MP is continuously updating their computations. For example, I think many of the energies gathered for the mp_e_form
dataset have changed in MP; so if someone was to look at a matbench dataset and see mp-XXX
, then go to MP to see more properties, they will find a difference. So it will need to be made very clear to everyone that the numbers in MB are from a specific task-id (or a specific date), and MPID != MBID.
@CompRhys @ml-evs @janosh I think the best way forward is this:
- The current matbench datasets + benchmarking procedure (v0.1, as I've been calling it) will remain as they are, even with the possibly unphysical log K/G entries and lack of source-ids. This is to maintain provenance with the paper.
- The infrastructure I'm building here is extensible to more datasets and benchmarking procedures and it will be fairly easy to extend; the suggestions in this thread will be incorporated in matbench v1.
@ardunn There appears to be a typo in the matbench_mp_e_form
description. The cutoff energy is said to be at 3 eV:
Removed entries having formation energy more than 3.0eV and those containing noble gases.
Based on this histogram, it actually appears to be 2.5 eV:
@janosh you are right! Thanks for noticing. I think 2.5eV was needed to remove the ~1500 1-D misconverged half-heuslers in MP at the time of collection. I think it has since been corrected with MP's SCAN workflow, so I will fix the description and in the next update we can rethink the energy cutoff without worrying about these 1500 entries; predicting highly unphysical entities should be one of the goals of the MP e_form set
Also, an update to this thread as @rkingsbury has cleaned and created a nice expt_formation_enthalpy
dataset with corresponding MPIDs (i.e., source ids
) cross-referenced from ICSD's experimental DB and corroborated with MP's convex hulls. We are planning on adding his raw dataset to matminer, at which time we can start creating a matbench dataset with it for matbench v1.0!
Also, @rkingsbury has similarly added MP source IDs to the expt_gaps dataset, but I have yet to add them or incorporate them here. Thanks ryan!
Happy to contribute, @ardunn . See https://github.com/hackingmaterials/matminer/pull/602 for the new datasets.
@ardunn matbench_perovskites
shows an outlier (mb-perovskites-00701, contrib ID: 5f6953e517892ff2440e9d0c) with e_form
of 760 eV in interactive view. Perhaps you're already aware since the dataframe returned by matminer.datasets.load_dataset('matbench_perovskites')
lists the same entry with 0.76 eV.
Hey @janosh thanks for finding this! Something must have gone wrong in the upload script for the perovskites. I'm pinging @tschaume in case he has an easy way to fix this single entry, and how(?) the order of magnitude got changed.
@janosh that's a great catch! You found the ONE entry in matbench_perovskites
saved with meV
as unit instead of eV
. Might be a remnant of a previous upload that failed to overwrite. It's fixed now, though.
An update to this thread that @rkingsbury 's datasets and @janosh suggestion for a new version of Ricci et. al.'s Boltztrap dataset has been added to matminer. The typo @janosh mentioned above in matbench_mp_e_form
has been fixed.
I am hesistant to use the Boltztrap dataset as an addition to matbench for multiple reasons (thank you @janosh for creating it though, it saved me a lot of time :D).
I think both the kingsbury datasets can be incorporated into matbench sometime in the future.
Composition/hardness dataset (~1000 points) scraped from the literature. GitHub (see hv_comp_load.xlsx
), paper
@sgbaird great! We should include this on matminer first as a full-fleshed dataset.
It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). https://github.com/anhender/mse_ML_datasets/issues/2
Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets
See discussion on a stability dataset in https://github.com/materialsproject/matbench/issues/104
It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). anhender/mse_ML_datasets#2
Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets
I think we had previously discussed including some of these datasets into matbench with Prof. Sparks, though I never got around to actually doing it.
In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.
In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.
Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?
Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?
Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.
What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".
Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.
What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".
Ok, I think I'm on the same page, and I like the phrasing "most accurately reflect the breadth of our field - in the fewest tasks possible". A collection of adaptive design tasks from the literature seems pretty compelling to me (matbench_adapt
or something like that), such as the two tasks from the paper you mentioned in https://github.com/sparks-baird/mat_discover/discussions/44#discussioncomment-2129894. If this kind of benchmark was already available, I'm pretty sure I'd be running mat_discover
on all the ones that I could 😅.
The two tasks you mentioned fall into the category of "real data in a predefined list", as opposed to continuous or semi-continuous validation functions like the tests you did on Branin/Rosenbrock/Hartmann. It's been on my mind a lot if there's a continuous, inexpensive validation function that would mimic a true materials science objective well enough. I've seen it where people used one of their trained neural network models as the "true" function, but I couldn't help but feel a bit suspicious.
There's the somewhat unrealistic alternative of, why not just use the true, expensive DFT calculation? I've played around with the idea in my head of whether or not matbench
could integrate with some paid-compute service (e.g. AWS, paid by the submitter of the algorithm of course) so that it's doing a real DFT simulation in a much larger candidate space, i.e. the "benchmark" produces real iterations.
Yeah, having matbench integrate with some DFT-in-the-loop option might be nice. But at the same time I am trying to keep it relatively simple while still serving some useful purpose. A benchmark that is difficult to understand or highly stochastic is not the goal. Definitely warrants further thought though.
Three generative model benchmark datasets and some metrics introduced in http://arxiv.org/abs/2110.06197 (see section 5. Experiments)
Tasks. We focus on 3 tasks for material generation. 1) Reconstruction evaluates the ability of the model to reconstruct the original material from its latent representation z. 2) Generation evaluates the validity, property statistics, and diversity of material structures generated by the model. 3) Property optimization evaluates the model’s ability to generate materials that are optimized for a specific property.
Figured it was worth mentioning in this thread.
Would love to have Matbench for generative models. @ardunn @txie-93 and anyone else, thoughts? Playing around with the idea of forking matbench
as matbench-generative
with visualizations similar to that of http://arxiv.org/abs/2110.06197
Thanks, @sgbaird. I think it is totally possible to have a matbench-generative
. We had 3 different tasks: 1) reconstruction; 2) generation; 3) property optimization. Not all existing generative models can perform all 3 tasks. From my perspective, most existing models can do 2) so it can be used as a main task for matbench-generative
. Each model will generate 10,000 crystals and they can be evaluated using https://github.com/txie-93/cdvae/blob/main/scripts/compute_metrics.py. However, it would take some effort to port existing models into the same repo.