matbench-discovery icon indicating copy to clipboard operation
matbench-discovery copied to clipboard

WBM filtering fails assert on entries_old_corr

Open jackwebersdgr opened this issue 1 year ago • 3 comments

I'm attempting to compile the filtered WBM dataset in order to test a new model, but ran into this assert:

https://github.com/janosh/matbench-discovery/blob/c1f34dac228ef7f0da7598e99fe9be6db36198f5/data/wbm/compile_wbm_test_set.py#L531

assert len(entries_old_corr) == 76_390, f"{len(entries_old_corr)=}, expected 76,390"
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

AssertionError: len(entries_old_corr)=256963, expected 76,390

Is the fully filtered WBM dataset stored anywhere else or must it be computed on the fly using this script? Thanks!

jackwebersdgr avatar Sep 09 '24 15:09 jackwebersdgr

strange, i reran that whole script just last month following https://github.com/janosh/matbench-discovery/issues/121 and it was back to a working state after https://github.com/janosh/matbench-discovery/pull/122. either way, no need for you to run that file yourself. the data files it creates are listed here, also up on figshare and will be auto-downloaded if you access the corresponding DataFiles attributes. e.g.

import pandas as pd

from matbench_discovery.data import DataFiles

df_summary = pd.read_csv(DataFiles.wbm_summary.path)
df_wbm_init_structs = pd.read_json(DataFiles.wbm_cses_plus_init_structs.path)

janosh avatar Sep 09 '24 17:09 janosh

Ah, I see. I was under the impression that this script would perform the filtration based on matching structure prototypes in MP, but it seems like this also results in ~257k datapoints. Is there a simple way to obtain the filtered 215.5k set, perhaps via some set of material_ids?

Additionally, it seems the documentation in the site is out of date, and can be updated with the above code https://matbench-discovery.materialsproject.org/contribute#--direct-download

jackwebersdgr avatar Sep 10 '24 13:09 jackwebersdgr

have a look at the subset kwarg in load_df_wbm_with_preds

https://github.com/janosh/matbench-discovery/blob/c1f34dac228ef7f0da7598e99fe9be6db36198f5/matbench_discovery/preds.py#L90-L111 https://github.com/janosh/matbench-discovery/blob/c1f34dac228ef7f0da7598e99fe9be6db36198f5/matbench_discovery/preds.py#L180-L182

and

https://github.com/janosh/matbench-discovery/blob/c1f34dac228ef7f0da7598e99fe9be6db36198f5/matbench_discovery/enums.py#L141-L148

janosh avatar Sep 10 '24 14:09 janosh

@jackwebersdgr so this script does not produce the ~215k datapoints? The script crashes as the same point for me. Additionally, to clarify it is this reduced dataset that is being shown on the main page of matbench discovery, correct?

It also seems that the link for the WBM summary is not working. What I would like to do is download the summary and filter out by unique prototype so I can play around with the validation set locally.

Update The figshare link worked and I can see all of the data from the summary ❤️.

@janosh

rydeveraumn avatar Dec 08 '24 01:12 rydeveraumn

@jackwebersdgr so this script does not produce the ~215k datapoints?

the compile_wbm_test_set.py script calculates the structure prototypes using get_protostructure_label_from_spglib

https://github.com/janosh/matbench-discovery/blob/93d1d5c50d99ae65f00cf255d8ffa74d2163d18c/data/wbm/compile_wbm_test_set.py#L611

and stores them in df_summary[Key.wyckoff]. this is used to calculate the subset of unique prototypes that don't overlap with MP proptypes:

https://github.com/janosh/matbench-discovery/blob/93d1d5c50d99ae65f00cf255d8ffa74d2163d18c/data/wbm/compile_wbm_test_set.py#L673

df_summary[Key.uniq_proto] is a boolean Series that is used across the code base to query for the 215k subset of WBM structures that are expected to relax to a novel and unique structure

janosh avatar Dec 08 '24 15:12 janosh

closing this as completed. remaining issues in compile_wbm_test_set.py are tracked in https://github.com/janosh/matbench-discovery/issues/179

janosh avatar Dec 27 '24 06:12 janosh