keepsake
keepsake copied to clipboard
Version control for machine learning
# Problem It's very long, which makes sense because it's a comprehensive inspection of all the data about an experiment/checkpoint. But, the most useful information is at the top, which...
# Why? It is possible to record params when creating an experiment, and it is possible to record metrics when creating a checkpoint, but sometimes you need to record the...
[We are seeing failures where the heartbeat is invalid JSON.](https://github.com/replicate/replicate/runs/1622823932) This implies writes are incomplete on read. Any writes to disk storage should be atomic. Using e.g. https://github.com/google/renameio Blocked on...
# Why Experiments in Replicate are currently just a "bundle" of experiments, not related to each other. More often than not when running an experiment, you are building off a...
# Why You can add params to experiments when starting them, but you might also want to add metadata after the fact to annotate them with. For example: - "bad"...
Currently, running `make develop` and `make test` installs python packages via the default system python (if a virtual env is not setup). Ideally, there should be an optional step to...
# Why Since ML models are often slow and expensive to train, we tend to spend a lot of time fine tuning computational performance. If we run our own servers...
I thought we were tracking Python version but we're not. Some things off the top of my head: - Python version - Operating system - Architecture - CUDA version -...
# Why Currently we just output simple matplotlib charts. It would be nice to have some interactive plots for: - Viewing data when hovering - Updating output without editing code...
# Why We record files, but they don't show up in `replicate diff`. This would be useful to understand what changes in code/files caused changes in your metrics. # How...