ENMeval Several minor issues

Dear Jamie, I work with sdm in a big project based on real data from Russia. First of all, thank you very much for the excellent tool you made. It makes our life better! ))) I adopted myself for the new version of ENMeval quite recently and found some issues that seem to me as bugs. May be features ))) The first is the excellent option clamp=T. It is really a very useful thing, especially in my case, where huge territories are sparsely covered with data. However I cannot push ENMeval to work with rasters containing any NAs with clamp=T. I masked and cropped everything on everything, so I am sure that all rasters have NAs in the same places and no points are outside data. Is it so, and clamping does not work with rasters containing NAs? Or it is a problem of my hands?

mod <- ENMevaluate(csv.spp@coords, env.pc, bg = bg, occs.testing = test.spp@coords, tune.args = list(fc = Pred.fs, rm = Rs), partitions = "testing", algorithm = meth, doClamp = T, other.settings = list(abs.auc.diff = T, pred.type = result_type, validation.bg = 'full'), parallel = T, numCores = NCores) *** Running initial checks... ***

Clamping predictor variable rasters...
Model evaluations with testing data...
*** Running ENMeval v2.0.2 with maxent.jar v3.4.3 from dismo package v1.3.5 ***
| | 0%
Of 4 total cores using 2...
Running in parallel using doSNOW... |===============================================================================| 100%
Error in { : task 1 failed - "There are one or more NAs in the orig.vals table. Please remove them and rerun."

The second question concerns AICc again )) Now it works perfectly! Quickly and memory use is brilliant. Nevertheless I found that AICc is calculated on the basis of the training sample, even if test sample is given. For sure, I can calculate AICc based on the test sample myself, but it will add some additional time to my day-long calculations. Indeed, it is a good idea to calculate any testing statistics on the basis of the independent sample, am I correct? I compared 'train' and 'test' AICc-s in a big number of experiments; the 'train' one tends to 'select' the simplest model. My code for any case. May be I made some mistake or overlooked some option: mod <- ENMevaluate(csv.spp@coords, env.pc, bg = bg, occs.testing = test.spp@coords, tune.args = list(fc = Pred.fs, rm = Rs), partitions = "testing", algorithm = meth, doClamp = F, other.settings = list(abs.auc.diff = T, pred.type = "raw", validation.bg = 'full'), parallel = T, numCores = NCores) sss <- as.data.frame(extract(mod@predictions, csv.spp)) aic.maxent(sss, mod@results$ncoef, mod@predictions) So, 'testing' is selected, but AICc is 'training' ) By the way, I spent an hour looking for appropriate syntax with partitions = "testing" option. There was no alert that validation.bg must be 'full' and cannot be empty. May be adding such handler will economy another hour for somebody else.

Dec 26 '21 20:12 andliszmmu

@andliszmmu, thanks for the message. I will look into the NAs and clamping issue.

As for AICc, it is always calculated on the full training dataset, which includes validation data when doing cross-validation. The term "training" can be confusing because it could refer to training data used for a fold, but I am referring to all the data entered when running the function. When you select partitions = "testing", this data is not part of the full training dataset, in that it is fully withheld from any model-building, and thus is not incorporated in the AICc calculation. Information criteria are not meant to be calculated on withheld data from my understanding.

Regarding the error when selecting validation.bg = "partition" and partitions = "testing", I programmed that error in so that the function stops when a user tries to do validation background evaluation with anything but a spatial partition, as this technique is meant only for spatial ones. The point is that the validation data should be environmentally different from the training data for each fold if the partitions are spatial, and thus using the background of the validation data only to calculate performance metrics is a stricter test of performance (though there has been little research into whether this results in "better" models via model selection -- it's kind of experimental, but some people wanted the option). I will try to improve the documentation for the next version so this is less confusing.

Dec 27 '21 01:12 jamiemkass

@jamiemkass thank you very much for the answer!

Regarding the error when selecting validation.bg = "partition" and partitions = "testing", I programmed that error in so that the function stops when a user tries to do validation background evaluation with anything but a spatial partition, as this technique

Thanks, probably adding a rule into initial checks would be even more useful, since in general this information exists in the documentation. But it is hard to process big portions of information ))) Something like if(partitions == "testing" || validation.bg != "full") cat("something")

As for AICc, it is always calculated on the full training dataset, which includes validation data when doing cross-validation. The term "training" can be confusing because it could refer to training data used for a fold, but I am referring to all the data entered when running the function. When you select partitions = "testing", this data is not part of the full training dataset, in that it is fully withheld from any model-building, and thus is not incorporated in the AICc calculation. Information criteria are not meant to be calculated on withheld data from my understanding.

Thank you. Will add one round of calculations in my code, it is not a problem. Nevertheless, since there are at least two meanings of words 'training' and 'test', one can mix things. If data are withheld from analysis, they are withheld ))) But from different point of view, it is a normal testing dataset. Am I correct that it is the only way to use my own testing dataset? May be I am wrong, but I see the following. We have a number of ways to make training and test data spatially independent by ENMeval algorithms. It is useful for AUC calculations (probably for OR and CBI too, but not for AICc). If I want to use AICc in model selection, I always get a set of values based on full training sample, well? It is not the best way for evaluation, since in this case we do not avoid spatial correlation between training and testing datasets (they are equal). Since AICc is some proportional to ncoef, the simplest model will selected. I checked, it is so. Well, it is a kind of discussion )

Dec 27 '21 09:12 andliszmmu

Some addition. It looks like values of AUC and AICc are not calculated in the same frame. AUC is calculated on the basis of training (sensu stricto) and testing (withheld in this case) data; while AICc - on training dataset only. Although the very AICc is worth evaluating on independent data. Well, probably it is my own opinion, but may be it is interesting to note in documentation

Dec 27 '21 09:12 andliszmmu

@andliszmmu Strictly speaking, AICc is calculated on the full dataset, which is referred to as "training data" when the model is built with all the data (i.e., nothing withheld). "Training AUC" is calculated the same way, with the full occurrence dataset and the full background. When you have a fully withheld dataset, sometimes called an "independent testing dataset", this is not used by either of these "training" performance metrics. However, regardless of the partitioning you choose for withheld data, all of these data are used for these metrics. Hope this makes it more clear.

Jan 04 '22 05:01 jamiemkass

Jamie, it works! I mean doClamp option. I am really sorry. The problem was not in option but in my spatial points some of which felt outside rasters. I was confused because firstly I see something like:

Removed 2 occurrence points with NA predictor variable values.
Removed 18 background points with NA predictor variable values. and after that execution failed. I thought that clamping was executed on the dataset AFTER removing of such points )) But it was not )) Thank you very much for your excellent tool!

Feb 02 '22 15:02 andliszmmu

Great! Thanks for using it~

Feb 03 '22 04:02 jamiemkass

Jamie, sorry, that is me again ) Everything works now, but doClamp = T option consume fantastic part of memory for java. I never used more than 4Gb for it, but now something changed. The same raster set works well with options(java.parameters = "-Xmx5120m") and doClamp = F; but gives "java.lang.OutOfMemoryError: Java heap space" with options(java.parameters = "-Xmx12192m") and doClamp = T. 12 Gb is a lot of memory...

Feb 04 '22 07:02 andliszmmu

The very maxent.jar, works separately (not with R, standalone) on the same dataset with -mx512m

Feb 04 '22 07:02 andliszmmu

Andrey,

Sorry, this would be a dismo issue, because dismo is calling maxent.jar through R, which makes it slower. Not sure why it is so much slower though. However, there's nothing ENMeval can do about that. Have you tried using maxnet? Not sure, but you might see an improvement?

Otherwise, I'd recommend just using the Java software for your data, which is a pain because it's manual, but you may not have much choice. Sorry, wish there was something I could do.

Feb 07 '22 02:02 jamiemkass

Dear Jamie,

I executed some time consuming tests )) Well, I took a set of rasters and calculated the same consequence of commands in R. In the first attempt this sequenced finished with dismo::maxent(with doClamp=false); the second with dismo::maxent(with doClamp=true); the third with ENMevaluate(with doClamp = F) and the last ENMevaluate(with doClamp = T). All parameters are equal: "LQ" and R=1. I repeated this sequence with some step in the memory size value in options(java.parameters = "-Xmx180m") Windows; I closed R after each cycle. The value of max memory for java was fixed if the run was successful. Well, both runs with maxent was successful starting from 10.5M. I cannot find the difference in memory consuming between doClamp=false and doClamp=true. ENMevaluate(with doClamp = F) was successful starting from 180M ENMevaluate(with doClamp = T) was successful starting from 200M So... What should be the reason? Hardly the dismo only? I can understand if memory consuming would grow twice (two cores), but about 20 times... For any case I copy commands here; may be some arguments consume memory: modm <- maxent(env.pc, csv.spp, a=bg, args=c('linear=true', 'quadratic=true', 'product=false', 'hinge=false', 'threshold=false', 'betamultiplier=1.0', 'doclamp=true'))
mod <- ENMevaluate(csv.spp@coords, env.pc, bg = bg, occs.testing = test.spp@coords, tune.args = list(fc = c("LQ"), rm = c(1)), partitions = "testing", algorithm = meth, doClamp = T, other.settings = list(abs.auc.diff = T, pred.type = 'raw', validation.bg = 'full'), parallel = T, numCores = NCores, overlap = F)

I tried to use maxnet from its appearance. However two points killed my enthusiasm. The first is the endless number of bugs in the earlier versions. Since I check various parameters, I catch some bug anyway. The second - result object has a different structure with some important (for me) outputs missing. So I left the idea to continue the torture with this operator ))

Feb 11 '22 19:02 andliszmmu

Glad to know all the runs work, but it is strange that there is such a discrepancy between ENMeval and dismo::maxent when they are doing similar operations. As you know, ENMeval is a big wrapper for dismo::maxent (or other algorithm), so it is doing more operations. I will try to look into this to see if I can shed some light. Thanks for your efforts!

Jamie

Feb 13 '22 00:02 jamiemkass

Not sure if this still applies now. Closing for now.

Jun 14 '24 09:06 jamiemkass