CASMcode
CASMcode copied to clipboard
ECIs fitting from RFE, LASSO and GA
Dear CASM developers,
I got three sets of ECIs fitted from three algorithms. My question is that is there any preference of which one is better?
From GA, 40s terms, relative large values (~20); From RFE, 20s terms, relative large values (~20); From LASSO, 20s terms, small values (~5)
Should I pick LASSO, since it gives the smallest number of ECIs and smallest absolute values of ECI?
Those three sets of ECIs have similar CV and RMSE (ga:0.04, lasso and rue: 0.06).
GA "eci": [ [0, -23.300125174783126], [1, -0.6966258560535161], [2, -23.184027600349502], [3, 3.651553644212952], [4, 50.627058774444336], [5, 1.11661024720773], [6, 5.447026016337316], [7, 7.135666930647176], [8, 8.47478320561154], [11, 2.8045261093909355], [12, -2.281993199775078], [13, -1.6357081145432537], [15, -7.952378604551303], [17, 0.4597682449938426], [18, 0.9451936937278295], [19, 3.132628944681573], [20, 0.08584825548742027], [21, 0.5395808165304457], [22, 0.34189501739540956], [23, 0.3097658956369951], [24, 0.6024220472094524], [26, 0.5444069958936473], [28, -10.882856643588957], [29, 0.31049304024860014], [30, -2.8253982213833213], [31, -8.70530032511232], [33, 1.793871517110546], [34, -0.5594012330669671], [35, -1.1868627824327007], [36, -0.9199939241918982], [37, -2.310680651467844], [39, -0.015488132872005234], [41, -2.9811056160284677], [45, 5.454701228980583], [46, -2.8318324920336155], [48, -1.281717787163789], [50, -0.9407270530392909], [52, 0.9118318762765529], [53, 0.7626269821961669], [54, -0.43175986948360845], [58, -0.9455640425811835], [60, -0.6978579548178488], [62, 0.8351617492422243], [63, 0.9853161449725789], [66, -0.40655766146782457], [67, 1.2050345956368056], [69, -0.32435443970798994], [71, -0.0204117107958266], [72, 0.6152002261176135], [79, 8.16888648111602], [81, 0.3630649489459543], [82, -1.16620221659248], [84, -1.7252509449579607], [87, 0.22454659506649133], [88, 4.0225934311816145], [93, -2.292149879563076], [95, 1.1178882336408615]
RFE: "eci": [ [0, -20.456135176005482], [2, -27.13358854045982], [3, 7.5720045934089315], [4, 53.94882683633311], [6, 6.852033352758312], [7, 8.826709142888996], [8, 8.641018451922879], [12, -4.023590226367106], [15, -7.810186590633954], [19, 3.11882765224048], [28, -12.080911886751952], [30, -2.6364735509293267], [31, -9.046543965031864], [33, 1.0358494417849053], [35, -3.114854204342694], [37, -2.115602437666141], [41, -4.113143183579206], [43, -1.229357252360113], [45, 4.681607737676124], [46, -2.181593493438527], [53, 1.3846288365993207], [79, 8.552053348885764], [84, -1.5817638610079263], [88, 5.2044808931278155], [93, -2.7286133899015983]
LASSO: "eci": [ [0, -5.325432888994544], [1, -0.5820041170949678], [6, 1.273292451907467], [16, -0.022881717946859805], [20, 0.1225649412971051], [21, 0.47008788934878937], [22, 0.11901468724973925], [27, 0.823398229643995], [29, 0.3071945887665172], [31, 2.2149572124258263], [38, 0.6920388539224142], [39, 0.06813644969201155], [45, 1.5109072420137946], [50, 0.13782043932927016], [51, 0.10476058476077763], [52, 0.0665648836507395], [55, 0.44617848182189146], [60, 0.1820211308869986], [66, 0.13786548648569144], [70, 0.32886209255022764], [72, 0.5306210415480956], [77, 0.04885332239097982], [88, 0.012697742053733705], [89, 0.4645779830100614], [92, -0.23492472538722786], [94, -0.37779118011858165]
It's not very typical to have such large values, which may be a sign of a poor fit / overfitting, but that may depend on your choice of reference energies. Note that formation energies CASM query reports are per unit cell, so the casm-learn reported CV and RMSE are per unit cell. Typically a good fit is on the order of meV per atom errors (so divide by number of atoms per unit cell). Ultimately you have to judge what is necessary based on the predicted results.
I think the error is on the order of meV per atom, but on the larger side.
Can you be more specific? How ECIs values large is large? How to avoid overfitting? The fitting is actually not bad, but maybe use too many ECI terms.