Matt Camp issues

Results 6 issues of


                                            Matt Camp

Pod goes OutOfnvidia.com/gpu before k8s-device-plugin is ready

I have an issue when using cluster autoscaling for GPU nodes. I am using Karpenter as the cluster autoscaler and I'm trying to deploy NVidia Riva. The pod deployment spec...

lifecycle/stale

Add Telegraf training and eval metrics

This PR adds the ability to push training and evaluation metrics to InfluxDB (via Telegraf). When combined with https://github.com/aws-deepracer-community/deepracer-for-cloud/pull/159 it should allow for some nice interactive Grafana dashboards. As the...

Add Telegraf/InfluxDB/Grafana compose stack for recording InfluxDB metrics

This PR adds a docker-compose stack which launches three additional services - Telegraf to accept UDP push metrics and pass to InfluxDB - InfluxDB to store time-series metrics - Grafana...

Add telegraf sagemaker metrics

Tidy eval metrics, make optional.

feat: Add feature to store custom_files and config in a subdir for each experiment

The feature enables each training session to have it's config and custom_files stored within a subdir under `experiments/`. This simplifies being able to locate the config and files that were...