vak icon indicating copy to clipboard operation
vak copied to clipboard

Finish Parametric UMAP model + add datasets

Open NickleDave opened this issue 2 years ago • 0 comments

I added an initial Parametric UMAP model family + one example model in https://github.com/vocalpy/vak/pull/688, fixing #631. I went ahead and merged that in so we could work on other things--there were a lot of changes made to be able to add that model.

There's still additional work to be done though:

  • [ ] fix / better test prep step -- I notice I get a train split that is 0.99 seconds even when I set the target duration to 0.2 seconds, likewise I got a val split that was 0.97 seconds when I set the target duration to 0.1
    • is this because we are using entire files somehow?
  • [ ] figure out whether we need to shuffle for training -- not clear to me this is needed?
  • [ ] make sure we have access to labels for training and eval when needed
    • [ ] do we need a labelmap.json for this? We're not predicting labels so there's no reason to map <-> consecutive integers
  • [ ] finish predict function
    • [ ] test that vak.predict_.predict calls this function appropriately
  • [ ] add learncurve function
  • [ ] add metrics from this paper to val step / evaluation: https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.13754
  • [ ] test whether embedding the entire dataset on a graph has an impact on validation / test set performance
    • [ ] i.e., is it ok to just embed val / test splits separately? how does this affect estimates of loss, other metrics
    • [ ] if so we should warn when people don't make val / test sets
  • [ ] add documentation with example tutorials
  • [ ] add some version of prepared datasets from Sainburg et al 2020 and models trained on those datasets
  • [ ] add / test ability to continue training of already trained models
  • [ ] add back and use labelmap for eval -- e.g. for metrics from https://besjournals.onlinelibrary.wiley.com/doi/full/10.1111/1365-2656.13754
  • [ ] modify training dataset in such a way that training doesn't always have to take forever; could we write a custom sampler that uses the probabilities to weight which samples it grabs for each batch?
  • [ ] evaluate the effect of hyperparameters / architecture on the model. To speed up tests I made the default number of filters in each layer of the ConvEncoderUMAP much smaller (in 454f159bbca890131f4ffcbb4f3f376f07c1e138) and this dropped the checkpoint size from ~1.7GB -> 25MB

NickleDave avatar Aug 14 '23 01:08 NickleDave