Austin Welch
Austin Welch
Hi, I am considering trying to refactor this repo to support distributed training with DistributedDataParallel (or maybe Horovod). Do you happen to foresee any major issues with that working?
Hi, I think it would help many people in debugging if some training image results (or maybe even periodic validation results) could be written to the tensorboard events files :)
Hi, I have a NVIDIA Triton server that’s running a TensorRT engine file of YOLOX I generated using MMDeploy. I’m wondering, is it possible to add ByteTrack or another multi-object...
**Describe the feature** StrongSORT MOT algorithm **Motivation** This algorithm is ranked 1st on paperswithcode for IDF1 with multiple datasets including MOT20. **Related resources** https://github.com/dyhBUPT/StrongSORT https://paperswithcode.com/paper/strongsort-make-deepsort-great-again
I'm running JupyterLab through AWS SageMaker. I've downloaded the dash-sample-apps repository and modified a couple of the app.py files with: ```python import jupyterlab_dash viewer = jupyterlab_dash.AppViewer() viewer.show(app) ``` When I...
Hi, I ran this with a very simple 10 layer CNN model I trained on MNIST using pytorch lightning. ```python orig_model = pl_module.model val_loader = trainer.datamodule.val_dataloader() scaled_model = ModelWithTemperature(orig_model) scaled_model.set_temperature(val_loader)...
Hi, I'm confused about the new readme "complete pipeline" example. Why does it do `dataset.batched(16)`, then `wds.WebLoader(..., batch_size=8)`, then `.unbatched()`, then `.batched(12)`? It says, "batch in the dataset and then...
(Apologies for creating multiple recent GitHub issues, this is the last one, I promise!) I took the DataFrame from my experiment results and used Plotly's `plotly.express.parallel_categories` plot to visualize hyperparameter...
Hi, I have a limit of 8 `ml.g5.12xlarge` instances, and although I set `Tuner.n_workers = 5` I still got a `ResourceLimitExceeded` error. Is there a way to make sure that...
Hi, I'm using SageMaker as a backend and remote launcher. I noticed that if a job errors out during training, the latest performance logs will not be captured. For example...