pytorch
pytorch copied to clipboard
fastai example no longer runs on 7900XTX using ROCm 6.1
🐛 Describe the bug
quick start example from fastai course no longer runs. I get the following error when running quickstart.py . It used to run fine when I tried it a few weeks ago. To run the coded, just install pytorch 2.3 & fastai.
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
pip3 install fastai
Then run the code
(pt) root@rocm:~/tmp# python quickstart.py
/root/pt/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
warnings.warn(
/root/pt/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ConvNeXt_Small_Weights.IMAGENET1K_V1`. You can also use `weights=ConvNeXt_Small_Weights.DEFAULT` to get the most up-to-date weights.
warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/convnext_small-0c510722.pth" to /root/.cache/torch/hub/checkpoints/convnext_small-0c510722.pth
100%|████████████████████████████████████████████████████████████████████████████████| 192M/192M [00:04<00:00, 43.0MB/s]
epoch train_loss valid_loss error_rate time
0 0.112615 0.000854 0.000677 00:38
epoch train_loss valid_loss error_rate time
0 0.010809 0.000541 0.000000 00:47
Training text processing model
epoch train_loss valid_loss accuracy time ██████████| 100.00% [105070592/105067061 00:03<00:00]
0 0.468745 0.412001 0.821000 01:09
epoch train_loss valid_loss accuracy time
Traceback (most recent call last):---------------------------------------| 0.00% [0/390 00:00<?]
File "/root/tmp/quickstart.py", line 20, in <module>
learn.fine_tune(2, 1e-2)
File "/root/fastai/fastai/callback/schedule.py", line 168, in fine_tune
self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
File "/root/fastai/fastai/callback/schedule.py", line 119, in fit_one_cycle
self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)
File "/root/fastai/fastai/learner.py", line 264, in fit
self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
File "/root/fastai/fastai/learner.py", line 199, in _with_events
try: self(f'before_{event_type}'); f()
File "/root/fastai/fastai/learner.py", line 253, in _do_fit
self._with_events(self._do_epoch, 'epoch', CancelEpochException)
File "/root/fastai/fastai/learner.py", line 199, in _with_events
try: self(f'before_{event_type}'); f()
File "/root/fastai/fastai/learner.py", line 247, in _do_epoch
self._do_epoch_train()
File "/root/fastai/fastai/learner.py", line 239, in _do_epoch_train
self._with_events(self.all_batches, 'train', CancelTrainException)
File "/root/fastai/fastai/learner.py", line 199, in _with_events
try: self(f'before_{event_type}'); f()
File "/root/fastai/fastai/learner.py", line 205, in all_batches
for o in enumerate(self.dl): self.one_batch(*o)
File "/root/fastai/fastai/learner.py", line 235, in one_batch
self._with_events(self._do_one_batch, 'batch', CancelBatchException)
File "/root/fastai/fastai/learner.py", line 199, in _with_events
try: self(f'before_{event_type}'); f()
File "/root/fastai/fastai/learner.py", line 223, in _do_one_batch
self._do_grad_opt()
File "/root/fastai/fastai/learner.py", line 211, in _do_grad_opt
self._with_events(self._backward, 'backward', CancelBackwardException)
File "/root/fastai/fastai/learner.py", line 199, in _with_events
try: self(f'before_{event_type}'); f()
File "/root/fastai/fastai/learner.py", line 207, in _backward
def _backward(self): self.loss_grad.backward()
File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 516, in backward
return handle_torch_function(
File "/root/pt/lib/python3.10/site-packages/torch/overrides.py", line 1636, in handle_torch_function
result = torch_func_method(public_api, types, args, kwargs)
File "/root/fastai/fastai/torch_core.py", line 382, in __torch_function__
res = super().__torch_function__(func, types, args, ifnone(kwargs, {}))
File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 1443, in __torch_function__
ret = func(*args, **kwargs)
File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
torch.autograd.backward(
File "/root/pt/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
_engine_run_backward(
File "/root/pt/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key failed on 2nd step: hipErrorSharedObjectInitFailed: shared object initialization failed
(pt) root@rocm:~/tmp#
Versions
(pt) root@rocm:~/tmp# wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
--2024-04-30 22:15:46-- https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22068 (22K) [text/plain]
Saving to: 'collect_env.py'
collect_env.py 100%[=================================================>] 21.55K --.-KB/s in 0.002s
2024-04-30 22:15:47 (9.44 MB/s) - 'collect_env.py' saved [22068/22068]
Collecting environment information...
PyTorch version: 2.3.0+rocm6.0
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.0.32830-d62f6a171
OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35
Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.0.32830
MIOpen runtime version: 3.0.0
Is XNNPACK available: True
CPU:
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Address sizes: 48 bits physical, 48 bits virtual
Byte Order: Little Endian
CPU(s): 24
On-line CPU(s) list: 0-23
Vendor ID: AuthenticAMD
Model name: AMD Ryzen 9 7900X 12-Core Processor
CPU family: 25
Model: 97
Thread(s) per core: 2
Core(s) per socket: 12
Socket(s): 1
Stepping: 2
Frequency boost: enabled
CPU max MHz: 5732.7139
CPU min MHz: 3000.0000
BogoMIPS: 9382.28
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization: AMD-V
L1d cache: 384 KiB (12 instances)
L1i cache: 384 KiB (12 instances)
L2 cache: 12 MiB (12 instances)
L3 cache: 64 MiB (2 instances)
NUMA node(s): 1
NUMA node0 CPU(s): 0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit: Not affected
Vulnerability L1tf: Not affected
Vulnerability Mds: Not affected
Vulnerability Meltdown: Not affected
Vulnerability Mmio stale data: Not affected
Vulnerability Retbleed: Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Not affected
Vulnerability Tsx async abort: Not affected
Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-ignite==0.5.0.post2
[pip3] pytorch-lightning==2.2.3
[pip3] pytorch-triton-rocm==2.3.0
[pip3] torch==2.3.0+rocm6.0
[pip3] torchaudio==2.3.0+rocm6.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.18.0+rocm6.0
[conda] Could not collect
(pt) root@rocm:~/tmp#
I just tried pytorch nightly built for ROCm 6.1 and it worked. nightly build for ROCm6.0 still fails. So, I suppose the problem is caused by incompatibility between 6.0 and 6.1? Here is the output from running the same code using pytorch 2.3 (official release)/2.4 nightly ROCm 6.0 and 2.4 nightly ROCm 6.1.
https://gist.github.com/briansp2020/3e5a83cc5b25bd0c69c30174d3f8e696