Finish Parametric UMAP model + add datasets

Open NickleDave opened this issue 2 years ago • 0 comments

I added an initial Parametric UMAP model family + one example model in https://github.com/vocalpy/vak/pull/688, fixing #631. I went ahead and merged that in so we could work on other things--there were a lot of changes made to be able to add that model.

There's still additional work to be done though:

[ ] fix / better test prep step -- I notice I get a train split that is 0.99 seconds even when I set the target duration to 0.2 seconds, likewise I got a val split that was 0.97 seconds when I set the target duration to 0.1
- is this because we are using entire files somehow?
[ ] figure out whether we need to shuffle for training -- not clear to me this is needed?
[ ] make sure we have access to labels for training and eval when needed
- [ ] do we need a labelmap.json for this? We're not predicting labels so there's no reason to map <-> consecutive integers
[ ] finish predict function
- [ ] test that vak.predict_.predict calls this function appropriately
[ ] add learncurve function
[ ] add metrics from this paper to val step / evaluation: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.13754
[ ] test whether embedding the entire dataset on a graph has an impact on validation / test set performance
- [ ] i.e., is it ok to just embed val / test splits separately? how does this affect estimates of loss, other metrics
- [ ] if so we should warn when people don't make val / test sets
[ ] add documentation with example tutorials
[ ] add some version of prepared datasets from Sainburg et al 2020 and models trained on those datasets
[ ] add / test ability to continue training of already trained models
[ ] add back and use labelmap for eval -- e.g. for metrics from https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.13754
[ ] modify training dataset in such a way that training doesn't always have to take forever; could we write a custom sampler that uses the probabilities to weight which samples it grabs for each batch?
[ ] evaluate the effect of hyperparameters / architecture on the model. To speed up tests I made the default number of filters in each layer of the ConvEncoderUMAP much smaller (in 454f159bbca890131f4ffcbb4f3f376f07c1e138) and this dropped the checkpoint size from ~1.7GB -> 25MB

Aug 14 '23 01:08 NickleDave