megnet icon indicating copy to clipboard operation
megnet copied to clipboard

Might be nice to see MEGNet on matbench

Open sgbaird opened this issue 2 years ago • 7 comments

Matbench is an ImageNet for materials science; a curated set of 13 supervised, pre-cleaned, ready-to-use ML tasks for benchmarking and fair comparison. The tasks span a wide domain of inorganic materials science applications including electronic, thermodynamic, mechanical, and thermal properties among crystals, 2D materials, disordered metals, and more.

The Matbench python package provides everything needed to use Matbench with your ML algorithm in ~10 lines of code or less.

https://matbench.materialsproject.org/ (I'm unaffiliated)

sgbaird avatar Jan 22 '22 19:01 sgbaird

We don't have an objection to MEGNet being on the website, and in fact, MEGNet results are actually reported in the matbench paper. I have no idea why the results are not reported in the website itself since those results clearly exist. See https://www.nature.com/articles/s41524-020-00406-3

In any case, I have a philosophical ambivalence to this kind of comparisons since I don't think pure MAE or classification performance is the most important metric for materials science. A MAE of 20 meV/atom or 30 meV/atom in the formation energy is not significant at all, and different CV splits can easily cause the errors to fluctuate by that amount.

shyuep avatar Jan 23 '22 00:01 shyuep

@shyuep I have a vague memory of seeing MEGNet on matbench before, so I'm glad you mentioned it being in the original matbench paper. Digging a bit further, this matbench issue explains that MEGNet is (temporarily) no longer on matbench due to file corruption, but most of the conversation seems to be in an internal thread email thread between @ardunn and @chc273.

In any case, I have a philosophical ambivalence to this kind of comparisons since I don't think pure MAE or classification performance is the most important metric for materials science.

I think you have a great point about using a single error-based metric (e.g. MAE) for quantifying performance in a materials science context. I think that the creator of matbench would probably agree that it doesn't tell the full story, but instead gives a general sense of performance across a wide range of small/large problems for different domains and modalities. As one of the leaders in the field, what metrics do you think are most important for materials science? (if you'll excuse me asking a rather broad question)

Related to this, I opened up some discussion about incorporating uncertainty quantification quality metrics which can have a big effect on "suggest next best experiment" algorithms such as Bayesian optimization.

A MAE of 20 meV/atom or 30 meV/atom in the formation energy is not significant at all, ...

One application that seems popular in a materials discovery context is the use of formation energy to calculate decomposition energy (or similar) as a measure of stability (Bartel et al.). If I remember correctly, I've seen articles with stability filtering criteria for the related metric, e_above_hull, as large as 500 meV/atom and as small as 50 meV/atom. Many of these articles use arc-melting as the synthesis procedure, which if I understand correctly, is more likely to produce materials in a thermodynamically stable state. In other words, I'd image that e_above_hull is a reasonable stability measure for arc-melting. I'd assume spark-plasma sintering is similar in terms of tending towards thermodynamically stable products, but please correct me if I'm wrong. Based on Fig 7. from the Bartel paper linked above, the MAE for formation energy gets magnified somewhat when calculating decomposition energy. With BOWSR (very nice work), you used an intermediate filtering criteria of 100 meV/atom followed by DFT calculations and then experimental synthesis. In this context, is the idea that a 10 meV/atom difference in MAE is insignificant because a slightly larger filtering criteria could just as easily be used?

... and different CV splits can easily cause the errors to fluctuate by that amount.

I agree about choice of CV splits often causing fluctuations on the same order of magnitude. Interestingly, in the case of the matbench formation energy task, the fluctuation across 5 CV splits seems to be fairly controlled for some models and large for others. I don't necessarily advocate 10 meV/atom as being a significant difference, but as I'm preparing to do some synthesis based on "suggested next experiments" from mat_discover, I've noticed there are quite a few choices for stability filtering criteria, training databases, candidate databases (or generative models), and synthesis routes. Happy to hear any insights you might have.

sgbaird avatar Jan 23 '22 06:01 sgbaird

I think there are many questions that are important for ML in materials science. But for me, the most practical question now is how can it be used for actually discovering and designing new materials? Despite lots of papers claiming incremental improvements in accuracy of ML algos, you will find that actual papers with new discoveries and experimental confirmation are few and far between. That is why I prefer to focus on things like BOWSR that handles a critical bottleneck for that purpose, i.e., how do you get structures when all you have are a theoretical ensemble of atoms. We have been working on better options than just Bayesian optimization from the energy.

Another important question is basically ensuring the extrapolability and "physicality" of ML models. You mentioned matbench being the "ImageNet" of materials. Fundamentally, ML in materials science in most cases cannot be compared to ML in image recognition or other domains. In MatSci, we know there are inviolable physical and chemical laws. We know extrapolation limits imposed by them, e.g., what happens when you have an ideal homogenous electron gas and the relationship with electron density, what happens when you pull two atoms far apart, etc. The same can't be said for things like "how do I tell an image of a cat from a dog".

  1. Yes the ehull error is magnified by errors in energies. But the paper also showed that graph based models (I would argue generally, structure-based models) actually gets you reasonable predictions to guess stability. Composition models, which is the majority of what you find in the literature, result in far larger errors, as shown by Bartel. (unless you are constraining structure to begin with, say if you are only interested in perovskite ABO3 structures).

  2. Uncertainty quantification is definitely important. That is why I prefer all MAEs or classification metrics to come with std deviations, minimally between CV splits but ideally across repeated runs over different shuffles and perhaps across different stratified sampling procedures where applicable. For your synthesis procedure, I can only suggest that you look at multiple metrics of prediction if possible before attempting synthesis. You can of course always have a DFT-based validation of the ML predictions before synthesis. Ultimately, DFT is still the best tool we have at the moment for predicting energies. ML is a surrogate that offers speed, but at the moment, not absolute accuracy.

shyuep avatar Jan 23 '22 16:01 shyuep

Hi @shyuep @sgbaird

But for me, the most practical question now is how can it be used for actually discovering and designing new materials? Despite lots of papers claiming incremental improvements in accuracy of ML algos, you will find that actual papers with new discoveries and experimental confirmation are few and far between.

I agree 100%.

The primary purpose of matbench is not for blindly chasing lower and lower errors, rather to provide some sort of common platform to compare the strengths and weaknesses of various models to be used downstream in real applications. Of course, a single benchmark can't cover all, or even a majority, of use cases - but we hoped to make it broad enough to at least be of some research use (e.g., "I want to predict some formation energies, I wonder how various models perform on the same datasets")

Another important question is basically ensuring the extrapolability and "physicality" of ML models. You mentioned matbench being the "ImageNet" of materials. Fundamentally, ML in materials science in most cases cannot be compared to ML in image recognition or other domains. In MatSci, we know there are inviolable physical and chemical laws. We know extrapolation limits imposed by them, e.g., what happens when you have an ideal homogenous electron gas and the relationship with electron density, what happens when you pull two atoms far apart, etc. The same can't be said for things like "how do I tell an image of a cat from a dog".

Yet our field regularly utilizes and modifies model architectures/training procedures/etc. from other domains in matsci ML work. Similarly, matbench uses the general idea of an ML benchmark and applies it to the materials domain. I'd argue that the lack of a direct correlate between task="discriminate between a cat and a dog" and task="determine E_gap of this atom ensemble" doesn't mean the entire idea of a benchmark for matsci-ML problems is invalid, but needs to be specialized and adapted for our particular domain (for example, through more incorporation of inviolable physical laws). You could make the totally reasonable case that the current matbench is not well adapted enough, and we could probably use some recommendations from other groups (e.g., MEGNet devs) on how to improve that.

That is why I prefer all MAEs or classification metrics to come with std deviations, minimally between CV splits but ideally across repeated runs over different shuffles and perhaps across different stratified sampling procedures where applicable.

Matbench has the former (see the Full Benchmark Data pages). The latter would of course be the better, though more computationally expensive. Even better is some additional UQ on each prediction, which as @sgbaird mentioned, we are considering putting into matbench. Of course, exactly how that is done is open to revision... I think the best case is having some easily-accessible community benchmark that is as representative of real matsci engineering problems as possible - whether it's matbench or something else doesn't really matter.

ardunn avatar Jan 23 '22 21:01 ardunn

Digging a bit further, this matbench issue explains that MEGNet is (temporarily) no longer on matbench due to file corruption, but most of the conversation seems to be in an internal thread email thread between

Yeah so to clarify, the original megnet results were done by a postdoc who left our group a couple of years ago, and when I was putting the results onto the leaderboard, the only file she still had access to was corrupted... I know, I am equally disappointed and surprised lol. Adding a newer version of megnet to matbench has been something I'd wanted to do myself for a while but hadn't gotten around to it :/

ardunn avatar Jan 23 '22 21:01 ardunn

@ardunn Just to clarify I am not denying matbench is useful. I am merely stating my own ambivalence towards chasing performance metrics. In the end, current models are "good enough" on certain properties and there are bigger problems to deal with. I would also argue that certain datasets are nowhere large or diverse enough to even be a useful metric for comparison. e.g., datasets that are like 100s of data points. I am pretty sure if you dive into the details of the data, you will see that the dataset is biased in some way.

shyuep avatar Jan 24 '22 03:01 shyuep

@shyuep @ardunn, thank you for your comments!

@shyuep I appreciate you mentioning the benefits to the field of less focus on incremental improvements in accuracy and more focus on actual materials discovery campaigns (and I would add, successful or not). I'm excited to hear about the follow-up work to BOWSR when it becomes available 🙂

Extrapolability, interpretability, and physicality. These certainly seem to be (at least a few of) the differentiators between other domains ("cats vs. dogs", Netflix movie recommenders) and materials informatics. For extrapolability, it seems like some performance metrics can be implemented such as leave-one-cluster-out cross-validation from Meredig et al., a holdout of top 1% Kauwe et al. (disclaimer: from my group), adaptive design from a list of candidates, or a made-up "ground truth" model (forgive the oximoron). For the last case, there were some interesting, limited results (in my opinion) claiming that Gaussian Process had better adaptive design results over 100 iterations than other, more accurate models (e.g. neural network ensemble and random forest, and interestingly the ground truth was chosen to be the trained neural network ensemble).

For interpretability, the more common approaches seem to be either symbolic regression or determination of feature importances based on physical descriptors.

I'm glad you bring up the physicality aspect, especially the consideration of physical laws. If you know of MatSci work that explicitly incorporates physical laws into a ML model rather than relying on physical descriptors alone, I'd be really interested to hear.

The Bartel paper and the comments in this thread have gotten me thinking about structure vs. composition more. Structure-based formation energy ML models have gotten really accurate (e.g. MEGNet and ALIGNN, down to ~20 meV), and like you said are "good enough" to be used in downstream practical applications. Composition-based results are (as might be expected) really poor for e_above_hull, which is maybe more of an indicator of the possible wide range of e_above_hull values for a given composition. BOWSR stuck out to me as a tool that could help "push the pareto front" on the trade-off between a structure- vs. composition-based materials discovery compaign, and I've been promoting it and thinking about how I might be able to use it in a more general way, i.e. "input an arbitrary composition, output a CIF".

Again, thank you for the discussion!

sgbaird avatar Jan 25 '22 06:01 sgbaird