matbench Possible additions and modifications for matbench v1.0

migrated from https://github.com/hackingmaterials/automatminer/issues/294

Additions

Composition/UV-Vis measurements from 179072 metal oxides from here
~~165 Thermoelectric zT measurements at 300K Id 150888 mentioned here seemingly gathered from here~~
~~160 thermoelectric pF measurements compared with experimental ID: 152463 from here~~
ucsb thermoelectrics which was recently added to matminer (2021.06.07)
585 superconductor temperatures ID: 2210 mentioned here
786 materials, 26 with curie point beyond 400k for 2D ferromagnets from here
~~expt_formation_enthalpy from matminer.datasets - @rkingsbury has already done a lot of work on this as far as cleaning for another project!~~
expt_formation_enthalpy_kingsbury from matminer
composition hv set from @sgbaird

Composition/hardness dataset (~1000 points) scraped from the literature. GitHub (see hv_comp_load.xlsx), paper

ternary nitrides set known as tholander_nitrides in matminer

Amendments

dielectric must remove n=1 entries which have since been deprecated/removed by MP
expt_gap and expt_is_metal Cd2Te should be examined
matbench_mp_* datasets should be remade to reflect SCAN functional and new energies
remove log10(*_VRH)=0.0 (*_VRH=1.0) entries from matbench_log_*_vrh datasets, as per @CompRhys comment; update data with newest from mp
remove isolated atoms from MP datasets as per @CompRhys comment
Update steels_yield to include more information about the samples (are they multiphase or single phase austentite) as well as include the temperature as an extra variable for prediction (for more info see original source dataset no. 153092 and potential predecessor dataset no. 114165

Structural changes

create tests that reflect use case: e.g., jdft2d should be evaluated on ability to identify mats with low exfoliation energy, glass should be organized according to chemical space, etc.
formation energy validation should be changed to reflect this article
allow for use of extra features (which can and should be used for prediction, esp. for experimental samples, e.g. Temperature in an updated steels_yield)

Evaluation changes

Include LOCO-CV or some method of clustering groups

Major extras and/or new benchmarks

Adaptive design/in the loop
Structure generation
Include time of evaluation as a metric (as extremely expensive algorithms can be counterproductive especially for in the loop methods)
Additions of new more "industrial" datasets including more diverse classes of materials, e.g. MOFs, as well as degrees of freedom or extra information (e.g., Temperature)

Feb 27 '21 07:02 ardunn

@janosh and I have been looking carefully at the log_kvrh and log_gvrh datasets and there are a few edge cases/things we found. There are some materials in both of these datasets where the relevant moduli are zero i.e. they're incompressible fluids?

We stumbled across these zero moduli materials whilst trying to debug a cgcnn implementation with a small cut-off radius ~ 4 \AA and that in some structures none of the sites had any neighbours within this cut-off. Strangely the fact that all the atoms were isolated did not cause our model to crash (the model does crash if a single atom is isolated which we were initially looking at). As 4 \AA is the default in MEGNet (which has a benchmark result for these datasets) it might be important for the structure-based datasets to specify a minimum cut-off radius to be able to get valid crystal graphs for all the entries.

Mar 09 '21 14:03 CompRhys

@CompRhys @janosh this is great you noticed this! Could you paste in the matbench-ids where this is the case? I believe they were checked but I may have missed some strange cases.

Strangely, we did not run into the same issues with CGCNN/MEGNet (at least to the best of my knowledge right now, @Qi-max actually ran the *graphnet training). Let's investigate this more?

Mar 09 '21 21:03 ardunn

@ardunn The indices of entries with zero bulk modulus in log_kvrh (14 in total) are

1149,  1163,  2116,  2186,  3851,  4776,  4816,  4822,  6631, 8446,  9420, 10024, 10676, 10912

and with zero shear modulus in log_gvrh (31 in total)

58,  1149,  1163,  1282,  1440,  1548,  1931,  2116,  2186, 2221,  2729,  4659,  4776,  4816,  4820,  4822,  6032,  6631, 6632,  8231,  8377,  9107,  9420,  9440,  9458,  9550,  9723, 9762,  9978, 10024, 10912

Here's the Colab notebook that looks at the data:

https://colab.research.google.com/drive/19QOM8i8ScM1fQGAt53SIMIEG6gn09RjN

Mar 10 '21 08:03 janosh

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

Mar 10 '21 10:03 janosh

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

+1 for this!

Mar 10 '21 12:03 ml-evs

* create tests that reflect use case: e.g., jdft2d should be evaluated on ability to identify mats with low exfoliation energy, glass should be organized according to chemical space, etc.

* formation energy validation should be changed to reflect [this article](https://www.nature.com/articles/s41524-020-00362-y)

Both great ideas; once we have sample-level predictions for the leaderboard there is plenty of nice exploration that could be done here, with automatic ranking against task-specific metrics as well as the boring standard ones.

Mar 10 '21 12:03 ml-evs

We made an mistake here - the data is the log modulus so a log modulus of zero corresponds to a modulus of 1 GPa which isn't unphysical (or at least not in the way I thought). However, might be worth excluding them anyway from reading the original workflow manuscript (https://www.nature.com/articles/sdata20159) as

Conditions i) and ii) are selected based on an empirical observation that the most compliant known metals have shear and bulk moduli larger than approximately 2 GPa. Hence if our calculations yield results below 2 GPa for either the Reuss averages [50] (a lower bound estimate) of K or G, these results might be correct but deserve additional attention.

The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/

Mar 10 '21 13:03 CompRhys

The point about minimum radius to ensure that there are no isolated atoms still stands. There are some problematic examples from MP that could potentially be in the MB MP datasets i.e. https://materialsproject.org/materials/mp-1093989/

I agree.

@ardunn It would be nice if matbench datasets had a 3rd column source_id as that would make it much easier to connect the composition/structure to other available properties.

In principle I don't have any problem adding these, but ...

I need to think some more about this, because these datasets serve as a "snapshot" of various online repositories such as MP. So for MP entries, a specific property is tied to a specific computation (not an mpid persay), and MP is continuously updating their computations. For example, I think many of the energies gathered for the mp_e_form dataset have changed in MP; so if someone was to look at a matbench dataset and see mp-XXX, then go to MP to see more properties, they will find a difference. So it will need to be made very clear to everyone that the numbers in MB are from a specific task-id (or a specific date), and MPID != MBID.

@CompRhys @ml-evs @janosh I think the best way forward is this:

The current matbench datasets + benchmarking procedure (v0.1, as I've been calling it) will remain as they are, even with the possibly unphysical log K/G entries and lack of source-ids. This is to maintain provenance with the paper.
The infrastructure I'm building here is extensible to more datasets and benchmarking procedures and it will be fairly easy to extend; the suggestions in this thread will be incorporated in matbench v1.

Mar 11 '21 19:03 ardunn

@ardunn There appears to be a typo in the matbench_mp_e_form description. The cutoff energy is said to be at 3 eV:

Removed entries having formation energy more than 3.0eV and those containing noble gases.

Based on this histogram, it actually appears to be 2.5 eV:

mp_e_form_hist

Mar 22 '21 12:03 janosh

@janosh you are right! Thanks for noticing. I think 2.5eV was needed to remove the ~1500 1-D misconverged half-heuslers in MP at the time of collection. I think it has since been corrected with MP's SCAN workflow, so I will fix the description and in the next update we can rethink the energy cutoff without worrying about these 1500 entries; predicting highly unphysical entities should be one of the goals of the MP e_form set

Mar 22 '21 20:03 ardunn

Also, an update to this thread as @rkingsbury has cleaned and created a nice expt_formation_enthalpy dataset with corresponding MPIDs (i.e., source ids) cross-referenced from ICSD's experimental DB and corroborated with MP's convex hulls. We are planning on adding his raw dataset to matminer, at which time we can start creating a matbench dataset with it for matbench v1.0!

Also, @rkingsbury has similarly added MP source IDs to the expt_gaps dataset, but I have yet to add them or incorporate them here. Thanks ryan!

Mar 22 '21 21:03 ardunn

Happy to contribute, @ardunn . See https://github.com/hackingmaterials/matminer/pull/602 for the new datasets.

Mar 23 '21 05:03 rkingsbury

@ardunn matbench_perovskites shows an outlier (mb-perovskites-00701, contrib ID: 5f6953e517892ff2440e9d0c) with e_form of 760 eV in interactive view. Perhaps you're already aware since the dataframe returned by matminer.datasets.load_dataset('matbench_perovskites') lists the same entry with 0.76 eV.

Screen Shot 2021-03-27 at 07 37 35

Mar 27 '21 06:03 janosh

Hey @janosh thanks for finding this! Something must have gone wrong in the upload script for the perovskites. I'm pinging @tschaume in case he has an easy way to fix this single entry, and how(?) the order of magnitude got changed.

Mar 30 '21 00:03 ardunn

@janosh that's a great catch! You found the ONE entry in matbench_perovskites saved with meV as unit instead of eV. Might be a remnant of a previous upload that failed to overwrite. It's fixed now, though.

Mar 30 '21 00:03 tschaume

An update to this thread that @rkingsbury 's datasets and @janosh suggestion for a new version of Ricci et. al.'s Boltztrap dataset has been added to matminer. The typo @janosh mentioned above in matbench_mp_e_form has been fixed.

I am hesistant to use the Boltztrap dataset as an addition to matbench for multiple reasons (thank you @janosh for creating it though, it saved me a lot of time :D).

I think both the kingsbury datasets can be incorporated into matbench sometime in the future.

May 29 '21 07:05 ardunn

Composition/hardness dataset (~1000 points) scraped from the literature. GitHub (see hv_comp_load.xlsx), paper

Jan 13 '22 05:01 sgbaird

@sgbaird great! We should include this on matminer first as a full-fleshed dataset.

Jan 13 '22 05:01 ardunn

It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). https://github.com/anhender/mse_ML_datasets/issues/2

Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets

Jan 30 '22 04:01 sgbaird

See discussion on a stability dataset in https://github.com/materialsproject/matbench/issues/104

Feb 06 '22 03:02 sgbaird

It's probably not pragmatic to add every single dataset, but there may be some that would be well-suited for matbench (disclaimer: my research group compiled these datasets together). anhender/mse_ML_datasets#2

Henderson, A. N.; Kauwe, S. K.; Sparks, T. D. Benchmark Datasets Incorporating Diverse Tasks, Sample Sizes, Material Systems, and Data Heterogeneity for Materials Informatics. Data in Brief 2021, 37, 107262. https://doi.org/10.1016/j.dib.2021.107262. https://github.com/anhender/mse_ML_datasets

I think we had previously discussed including some of these datasets into matbench with Prof. Sparks, though I never got around to actually doing it.

In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.

Feb 08 '22 01:02 ardunn

In practice, these datasets would likely be their own separate benchmark (e.g., "Matbench Option 2" or something) since the matbench website/code is already extensible to any number of benchmarks with similar format. We just need to decide on which benchmark datasets and evaluation criteria are actually needed for a new benchmark.

Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?

Feb 08 '22 03:02 sgbaird

Ok, good to know! It sounds like the worry isn't so much about having too many datasets contained in matbench and more about keeping them organized/compartmentalized as the number increases, correct?

Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.

What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".

Feb 08 '22 04:02 ardunn

Yeah, that is mostly true. We do want to keep the benchmarks generally minimal though. But I have no problem with adding another separate benchmark with its own 1-~20 tasks or so.

What we should aim for is to craft a set of tasks to most accurately reflect the breadth of our field - in the fewest tasks possible. Something like "the most bang for your buck".

Ok, I think I'm on the same page, and I like the phrasing "most accurately reflect the breadth of our field - in the fewest tasks possible". A collection of adaptive design tasks from the literature seems pretty compelling to me (matbench_adapt or something like that), such as the two tasks from the paper you mentioned in https://github.com/sparks-baird/mat_discover/discussions/44#discussioncomment-2129894. If this kind of benchmark was already available, I'm pretty sure I'd be running mat_discover on all the ones that I could 😅.

The two tasks you mentioned fall into the category of "real data in a predefined list", as opposed to continuous or semi-continuous validation functions like the tests you did on Branin/Rosenbrock/Hartmann. It's been on my mind a lot if there's a continuous, inexpensive validation function that would mimic a true materials science objective well enough. I've seen it where people used one of their trained neural network models as the "true" function, but I couldn't help but feel a bit suspicious.

There's the somewhat unrealistic alternative of, why not just use the true, expensive DFT calculation? I've played around with the idea in my head of whether or not matbench could integrate with some paid-compute service (e.g. AWS, paid by the submitter of the algorithm of course) so that it's doing a real DFT simulation in a much larger candidate space, i.e. the "benchmark" produces real iterations.

Feb 08 '22 06:02 sgbaird

Yeah, having matbench integrate with some DFT-in-the-loop option might be nice. But at the same time I am trying to keep it relatively simple while still serving some useful purpose. A benchmark that is difficult to understand or highly stochastic is not the goal. Definitely warrants further thought though.

Feb 11 '22 06:02 ardunn

Three generative model benchmark datasets and some metrics introduced in http://arxiv.org/abs/2110.06197 (see section 5. Experiments)

Tasks. We focus on 3 tasks for material generation. 1) Reconstruction evaluates the ability of the model to reconstruct the original material from its latent representation z. 2) Generation evaluates the validity, property statistics, and diversity of material structures generated by the model. 3) Property optimization evaluates the model’s ability to generate materials that are optimized for a specific property.

Figured it was worth mentioning in this thread.

Jun 03 '22 23:06 sgbaird

Would love to have Matbench for generative models. @ardunn @txie-93 and anyone else, thoughts? Playing around with the idea of forking matbench as matbench-generative with visualizations similar to that of http://arxiv.org/abs/2110.06197

Jun 04 '22 01:06 sgbaird

Thanks, @sgbaird. I think it is totally possible to have a matbench-generative. We had 3 different tasks: 1) reconstruction; 2) generation; 3) property optimization. Not all existing generative models can perform all 3 tasks. From my perspective, most existing models can do 2) so it can be used as a main task for matbench-generative. Each model will generate 10,000 crystals and they can be evaluated using https://github.com/txie-93/cdvae/blob/main/scripts/compute_metrics.py. However, it would take some effort to port existing models into the same repo.

Jun 04 '22 20:06 txie-93