Enrico Shippole comments

Results 155 comments of


                                            Enrico Shippole

trafficstars

GPU Benchmarks

Hi @lucidrains, I ran the benchmarks on an instance with 4 A100 (40GB) GPUs. Here are the results: `python3 benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len:...

GPU Benchmarks

@lucidrains I have also ran the benchmarks on 2 A100 (80GB) GPUs: `python3 benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len: 128 slower: 1.12x kernel: 0.28ms baseline:...

GPU Benchmarks

@lucidrains Very close and still incredibly impressive! I will also test the gpt-2 model with the two A100 (80 GB) GPUs on wikitext-103 / enwiki8 and document the training results...

GPU Benchmarks

@lucidrains Of course. I will rerun each of the benchmarks on an RTX 3090, A100 (40 GB), and A100 (80 GB) and document the results here. I am still running...

GPU Benchmarks

@lucidrains Here are the results for the new RTX 3090 benchmark run: `python3 benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len: 128 slower: 0.97x kernel: 0.23ms baseline:...

GPU Benchmarks

@lucidrains Here are the results for the new A100 (80 GB) benchmark run: `python benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len: 128 slower: 2.75x kernel: 0.67ms...

GPU Benchmarks

> this line in the benchmark `seq_len: 1024 slower: 7.83x kernel: 3.82ms baseline: 0.49ms` looks really strange > > were the benchmarks done on GPUs that are idle? @lucidrains I...

GPU Benchmarks

@lucidrains Hi Phil, I am testing on 8 different A100 (80 GB) devices. I will show the benchmarks for each device. Also, I forgot that I need extended permissions for...

GPU Benchmarks

CUDA DEVICE 0 `CUDA_VISIBLE_DEVICES=0 python benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len: 128 slower: 2.14x kernel: 0.52ms baseline: 0.24ms seq_len: 256 slower: 1.52x kernel: 0.45ms baseline:...

GPU Benchmarks

CUDA DEVICE 1 `CUDA_VISIBLE_DEVICES=1 python benchmark.py --only-forwards` float32 batch: 4 heads: 8 dim 64 ------------------------------------------------------------ seq_len: 128 slower: 1.11x kernel: 0.27ms baseline: 0.25ms seq_len: 256 slower: 1.22x kernel: 0.36ms baseline:...