torchskeleton icon indicating copy to clipboard operation
torchskeleton copied to clipboard

Trouble reproducing numbers

Open lengstrom opened this issue 4 years ago • 1 comments

Hi, thanks for the great work! We've used your architecture extensively and it works really well. I ran the training code in this repository for the first time today and am having trouble reproducing the results. I am using a machine with 9 A100s, 96 CPUs, 504 GB RAM. I'm using python 3.8.10 and torch version 1.9.0. I get only 71% accuracy in about a minute running the following script:

#!/bin/bash

set -e
OUT_PATH=$1
dired=$(mktemp -d)
echo "Logging torchskeleton in $dired"

cd $dired
git clone https://github.com/wbaek/torchskeleton.git torchskeleton-benchmark
cd torchskeleton-benchmark
git submodule init
git submodule update
pip install -r requirements.txt

ulimit -n 8192
python bin/dawnbench/cifar10.py --seed 0xC0FFEE --download | tee > log_dawnbench_cifar10.tsv

The training loss consistently decreases while the test loss is much less stable. Do you have any suggestions on how to reproduce the results in the readme? Using a v100 does not seem to fix matters either. My full logs from the run are in the following gist: https://gist.github.com/lengstrom/f079e99a872b89aad2fcf9302d894dee

Let me know if I can provide any extra details that would be helpful.

lengstrom avatar Oct 21 '21 03:10 lengstrom

I have the same issue, I use a single 3090, and I can't reproduce the accuracy of 94% neither. The BEST acc I can get is 88% in 25 epoch.

dercaft avatar May 19 '23 02:05 dercaft