VAST icon indicating copy to clipboard operation
VAST copied to clipboard

Help with using a mix of static and dynamic covariates

Open Alessandra-Gentile opened this issue 4 years ago • 7 comments

I am trying to run a multispecies VAST with a mix of static (year = NA) and dynamic (year = year) covariates and I am having some difficulty setting up my covariate_data.

I see on your website (https://github.com/James-Thorson-NOAA/VAST/wiki/Specifying-covariates) It says:

"If using a mix of static and dynamic covariates, then duplicate rows for static covariates for every value of Year"

This fills the duplicate rows for static covariates when you have dynamic covariates; but does not fill the duplicate rows for dynamic variables when you have static variables.

Therefore, I have instead tried to fill duplicate rows with NA for both static and dynamic duplicated rows.

Is this correct?

The VAST runs, however I then run into difficulty incorporating formula=... into make_data() due to the NAs from the duplicated rows for both static and dynamic covariates.

When using poly, I get the error "missing values are not allowed in 'poly' ".

And when using direct quadratic and bs, I get "Error in X[which(Year_Set[tI] == covariate_df[, "Year"]), , drop = FALSE] :subscript out of bounds."

To summarize; When using a mix of static and dynamic covaraites, do I fill duplicate rows with NA to set up covariate_data? How do I address the NAs in covariate_data when incorporating formula=... into make_data()?

Alessandra-Gentile avatar Jan 11 '21 22:01 Alessandra-Gentile

Sorry, I obviously don't have a good example to show here. I'll work on getting a data set with both static and dynamic covariates for the Wiki.

It seems to me that you could do some data-prep work externally to VAST to duplicate static covariates for each year, and then using spatial classes or nearest-neighbors to input a spatially balanced set of covariates (no NAs) when combining rows for those static and dynamic covariates. Have you explored that option a bit?

James-Thorson-NOAA avatar Jan 12 '21 00:01 James-Thorson-NOAA

Thanks for the reply,

I am able to externally duplicate for static variables at each knot over all years (as location is the same in each year) by using the mean value associated with each knot in every year, and then averaging each knot over the years. Resulting in each knot having its own value across all years.

However, I am having difficulty externally duplicating for static variables for each tow (as location varies by year) as required by covariate_data.

I will keep working on it too.

Alessandra-Gentile avatar Jan 12 '21 15:01 Alessandra-Gentile

As note to myself: In thinking about this further, I could add a new block of code to make_covariates around line 47 here that fills in NA values in covariate_data by row using nearest neighbors. However, this would have some annoying consequences, i.e., that subsequent nearest-neighbors usage wouldn't result in a Voronoi tessellation for each individual covariate. I'll have to think about whether its worth that trade-off

James-Thorson-NOAA avatar Jan 12 '21 18:01 James-Thorson-NOAA

Hello @James-Thorson-NOAA and @Alessandra-Gentile, I am wondering whether there has been any update on this as I am also encountering an issue incorporating both static and dynamic covariates.

To quickly summarize my problem, I first create a data table that has each a row for each sampling location and the columns contain Lat, Lon, Year, Temp (dynamic), Habitat (static). I then copy the habitat value (static) from each row for each year, but in doing so I need to include a value for the Temp (dynamic). If I copy the single temp in the data table it implies that temp is constant over time in the location and if I set the temp as NA I receive the error "Error in poly(Temp, degree = 2) : \n missing values are not allowed in 'poly'\n"

I am working with simulated data so I will try to input the actual temp value at these locations in each year to see if it will run, but I think there may be issues with that as well. My worries include the possibility that this may lead to overfitting the model and the fact that this approach would not be possible for data collected in the real world.

Blevy2 avatar Sep 15 '22 16:09 Blevy2

Thanks for revisiting this! But I'm also not sure I'm following. I think it should work if you have a single row for each unique combination of Lat-Lon-Year, with columns listing the Temperature and Habitat for that Lat-Lon-Year, where obviously the Habitat will the same for every pair of Years for a given Lat-Lon but Temperature will be likely different for each Lat-Lon-Year. Is that what you've tried including? If this isn't working, do you mind sending me a minimal-example to check out?

James-Thorson-NOAA avatar Sep 19 '22 22:09 James-Thorson-NOAA

Hi @James-Thorson-NOAA , thanks for the response. So I did try what you suggested but ran into some issues and also have a few concerns. For reference, I am running VAST models on simulated data from a model that I created where movement is dependent on a combination of static habitat preferences and dynamic temperature preferences. I therefore know for certain which covariates impact presence/absence as well as density and am trying to incorporate them into my VAST predictions.

One thing I am wondering is that while including a row for every year for the static covariate (habitat) makes sense, I wonder whether including the same information for the dynamic covariate (temperature) adds a problematic amount of additional information. For example, I have seasonal sampling data for 20 years in 14 strata, which produces a data table with 1140 rows per season. When I repeat rows as described it creates a data table with 22,800 rows per season. Do you think this might cause issues related to overfitting and/or long model run times?

Additionally, when I repeat rows for the static habitat covariate I have no problem populating the temperature column because I know the spatial temperatures used in my simulations, but I wonder what someone would do who is using real data. I guess I was assuming that if someone was using real tow data they would only know the temperature recorded during the tow and not in all other years at the same location. However, in that case would someone just rely on estimated temperature information from something like ROM or FVCOM?

Finally, I did run the model as you described by repeating rows for each static covariate and populating the temperature column as well. It ran for over 14 hours before providing an error that the upper bounds for both ln_H_input and logkappa1 were reached. I am not sure what to make of the ln_H_input error, but my understanding of the logkappa1 issue is that it means the convergence is very slow. I think this would then imply the surface of the objective function being optimized is fairly flat, which is confusing because I know the covariates that influence both presence/absence and density.

I am very much interested in hearing anyone's thoughts about this because I am able to get VAST models to converge with either static or dynamic covariates, but not the two combined. I will also keep working at it and report back if I figure it out.

thanks!

Ben

Blevy2 avatar Sep 22 '22 13:09 Blevy2

As a quick follow up I just ran the same model formulation except with different covariate X1_formula and X2_formula, and was able to get the model to converge very quickly (see different formulations below). Formulation A did not converge after 14 hours of run time but Formulation B converged very quickly.

Formulation A X1_formula = ~ poly(Habitat, degree=2 ) X2_formula = ~ poly(Temp, degree=2 ) + poly(Habitat, degree=2 )

Formulation B X1_formula = ~ poly(Temp, degree=2 ) X2_formula = ~ poly(Temp, degree=2 ) + poly(Habitat, degree=2 )

In my model the static habitat value mostly drives presence/absence while temperature preferences contribute to density so it is not clear to me how/why the formulas I chose initially would not converge.

In general I have noticed that the X1_ and X2_ formulas significantly influence convergence of a given model with covariates. I would like to determine the best way to incorporate covariates so I am wondering whether you have any insight about how to choose correct formulation? We have a discussion going in our VAST user slack channel where @aallyn has provided some interesting insight about the relationship between these formulas and the specific model structure being used (linked vs. independent).

Thanks for the discussion so far. I would be happy to keep it going!

Blevy2 avatar Sep 22 '22 16:09 Blevy2