CASMcode
CASMcode copied to clipboard
How to efficiently enumerate configs with dilute concentration?
Dilute concentration needs large supercell. However, enum random seems have to start from the smallest supercell, and it accounts duplicates for nconfigs.
Could you give a quick command to enumerate starting from a large supercell, and avoid the already numerated ones? Otherwise I have to give an abnormally large nconfigs, which slows down the enumeration drastically.
Thanks,
A couple ideas here:
-
Use --min and --max to specify minimum and maximum volume supercells to enumerate in directly
This overrides the default "min" and "max" parameters of the "supercells" option for the enumeration methods
-
Use the "supercells" / "scelnames" / "supercell_selection" options
See
casm enum --desc ScelEnum
for all the parameters of the "supercells" option. The "supercells" option also lets you specify the supercell directly through the transformation matrix from the primitive cell lattice. Alternatively, you can use the "scelnames" or "supercell_selection" options to directly specify which already enumerated supercells you want to enumerate configurations in. -
Maybe cluster perturbation enumeration with ConfigEnumAllOccupations would be useful
If you use JSON input to ConfigEnumAllOccupations you can include "cluster_specs" (see
casm enum --desc ConfigEnumAllOccupations
), with format described here, specifying which clusters to enumerate occupations on. Typically this is done for all clusters of increasing number and site-to-site distance up to some cutoffs, but you can also specify particular clusters directly with "orbit_specs". This lets you enumerate symmetrically unique perturbations in a supercell with the default configuration as the background (as specified with the "supercells", "scelnames", or "supercell_selection" options), or within user-specified background configurations (as specified with the "confignames", "config_selection", or "config_list" options).
I tried:
casm enum -m ConfigEnumRandomOccupations -i '{"n_config":900000}' --filter 'eq(comp_n(Y),mult(2,comp_n(Va)))' --min 10 --max 16
but why almost all configs are excluded by filter. Is there anyway to more efficiently enumerate the wanted configs for large supercells?
It seems for small supercells, the enumeration is quite efficient. And how large a supercell (how many configurations) is generally sufficient for ECI training?
Write supercell database... DONE Write configuration database... DONE -- Begin: ConfigEnumRandomOccupations -- Input from JSON (--input or --setings): { "n_config" : 900000 }
Input from casm enum
options:
{
"filter" : "eq(comp_n(Y),mult(2,comp_n(Va)))",
"input" : "{"n_config":900000}",
"max" : 16,
"method" : "ConfigEnumRandomOccupations",
"min" : 10
}
Combined Input: { "filter" : "eq(comp_n(Y),mult(2,comp_n(Va)))", "n_config" : 900000, "supercells" : { "max" : 16, "min" : 10 } }
-- Checking input -- primitive_only: true filter: true filter expression: eq(comp_n(Y),mult(2,comp_n(Va))) verbosity: 10 dry_run: false output_configurations: false
of initial enumeration states: 228
-- Begin: ConfigEnumRandomOccupations enumeration --
configurations in this project: 92495
Begin enumeration Enumerate configurations for: SCEL10_1_1_10_0_0_0 900000 configurations (2 new, 899998 excluded by filter).
Enumerate configurations for: SCEL10_1_10_1_1_0_0 900000 configurations (1 new, 899999 excluded by filter).
Enumerate configurations for: SCEL10_1_10_1_2_0_0 900000 configurations (6 new, 899994 excluded by filter).
Enumerate configurations for: SCEL10_1_10_1_3_0_0 900000 configurations (3 new, 899997 excluded by filter).
Enumerate configurations for: SCEL10_1_10_1_6_0_0 900000 configurations (0 new, 900000 excluded by filter).
Enumerate configurations for: SCEL10_1_2_5_1_0_0 900000 configurations (3 new, 899997 excluded by filter).
Enumerate configurations for: SCEL10_10_1_1_0_1_9 900000 configurations (2 new, 899998 excluded by filter).
Enumerate configurations for: SCEL10_10_1_1_0_8_1 900000 configurations (3 new, 899997 excluded by filter).
Enumerate configurations for: SCEL10_10_1_1_0_7_1 900000 configurations (5 new, 899995 excluded by filter).
Enumerate configurations for: SCEL10_10_1_1_0_1_6 900000 configurations (3 new, 899997 excluded by filter).
Enumerate configurations for: SCEL10_10_1_1_0_9_5 900000 configurations (2 new, 899998 excluded by filter). ......
The filter is applied after generating a configuration, so in a large supercell the likelihood of having a composition matching the filter is decreased.
CASM doesn't have a fixed composition enumeration method, but it does allow storing configurations encountered during Monte Carlo. So perhaps a useful approach would be to "fit" a cluster expansion with just a constant term, run canonical Monte Carlo at various compositions, and use the "enumeration"
Monte Carlo option to store encountered configurations. The defaults assume that a user wants to store configurations that break the cluster-expansion predicted convex hull, but you could change the "metric"
to just use "formation_energy" or some other quantity like composition would be the same for all configurations. You can also change the sample frequency to change how much the sampled configurations differ from each other.
BTW, what does it mean by saying which clusters to enumerate occupations on:
If you use JSON input to ConfigEnumAllOccupations you can include "cluster_specs" (see casm enum --desc ConfigEnumAllOccupations), with format described here, specifying which clusters to enumerate occupations on.
The "cluster_specs" option allows enumerating configurations that are perturbations of a "background" configuration. The perturbed configurations have 1, 2, 3, etc. sites different from the background configuration. All such perturbations can be generated by finding symmetrically unique clusters of sites, taking into account the background configuration's occupation. The "cluster_specs" option specifies the range of such clusters.