keepsake
keepsake copied to clipboard
System metrics
Why
Since ML models are often slow and expensive to train, we tend to spend a lot of time fine tuning computational performance. If we run our own servers we stare at nvidia-smi, htop, iotop, iftop, etc., which is far from ideal. If we're using Colab we're mostly left to guessing.
Additionally, for reproducibility it's important to know how much CPU, GPU, and memory resources were consumed, when deciding what type of machine is required to replicate a result.
How
replicate.checkpoint() automatically attaches system_metrics to the checkpoint data, which includes:
- CPU usage per CPU (pegged CPUs in data loaders are a common bottleneck)
- GPU usage per GPU
- GPU memory usage (since TF allocates all the GPU memory, might have to think of something smart here)
- System memory usage
- Disk bytes read/written
- Network bytes read/written
- etc.
User data
One user asked for this because a change in CUDA version caused their results to not be replicable, in a horrible hard-to-find way.
What operating systems are you targeting? You could get away with something like eBPF here if limiting it to Linux. It will be very lightweight and won't detract from any ML processing resources.
You could get all of this in a fairly straight forward way. The difficult question will be to figure out the sampling rate you want to aggregate at (15 seconds?).
If you're a team using AWS/GCP, these metrics may not matter that much to you compared with tracking the instance types you're using. That gives you better signals on the kinds of resource/budget limitations you may have had.
Hi @andreasjansson. I had an idea along the same lines. Adding some of the training metadata to the checkpoints, like basic GPU specifications(GPU name, memory and driver version) and time taken for each epoch etc. The motive I had in my mind was this helps in benchmarking models and hardware.
Basic GPU specifications can be obtained from pynvml, a python wrapper from NVIDIA's nvml. I'm not sure how to implement the time taken per epoch part.
https://github.com/qiuqiangkong/audioset_tagging_cnn/blob/750c318c0fcf089bd430f4d58e69451eec55f0a9/pytorch/pytorch_utils.py#L144 has some code for counting the number of MFLOPS, which is a useful, non-machine-specific way of profiling different neural net architectures.
pytorch_memlab has some good profiling for pytorch which could also be useful
@kvthr GPU metrics would be fantastic, I've found memory and per-GPU utilization to be very helpful when debugging bottlenecks.
FLOPS / MACs would be really good too. Thanks for that link @turian! There's also https://github.com/sovrasov/flops-counter.pytorch and https://github.com/Lyken17/pytorch-OpCounter, I haven't looked into them in detail so I don't know how they compare.
@andreasjansson just to follow up with you about what else I want that doesn't exist, or that I'm not aware of yet.
Here are two really serious questions I have about my current project:
-
Apparently, I have a lot of GPU memory access. This makes no sense to me because I am not loading anything onto the GPU. Everything should be pre-loaded. Nonetheless a lot of time is spent moving things to-and-fro the GPU. Here are more details
-
When I increase the batch size, I get GPU OOM errors. I have no idea why because the data seems quite small. I tried pytorch_memlab but it didn't help yet. Related issue: https://github.com/Stonesjtu/pytorch_memlab/issues/28
So here is one (or two) tools that I think would have broad adoption. As an added benefit, if they hooked into replicate.ai by default (could be disabled optionally perhaps) it would increase adoption of your tool:
-
A dead simple thing that shows me for pytorch (or Python GPU stuff in general) exactly what gets moved to and fro the GPU. So I can very quickly spot memory-transfer bottlenecks.
-
Improved GPU profiling that in a fine-grained but easy-to-read way that demonstrates what is causing high GPU mem usage and OOMs. This could be a python_memlab extension.
These are things I would adopt today.