pytorch icon indicating copy to clipboard operation
pytorch copied to clipboard

fastai example no longer runs on 7900XTX using ROCm 6.1

Open briansp2020 opened this issue 9 months ago • 1 comments

🐛 Describe the bug

quick start example from fastai course no longer runs. I get the following error when running quickstart.py . It used to run fine when I tried it a few weeks ago. To run the coded, just install pytorch 2.3 & fastai.

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/rocm6.0
pip3 install fastai

Then run the code

(pt) root@rocm:~/tmp# python quickstart.py
/root/pt/lib/python3.10/site-packages/torchvision/models/_utils.py:208: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and may be removed in the future, please use 'weights' instead.
  warnings.warn(
/root/pt/lib/python3.10/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or `None` for 'weights' are deprecated since 0.13 and may be removed in the future. The current behavior is equivalent to passing `weights=ConvNeXt_Small_Weights.IMAGENET1K_V1`. You can also use `weights=ConvNeXt_Small_Weights.DEFAULT` to get the most up-to-date weights.
  warnings.warn(msg)
Downloading: "https://download.pytorch.org/models/convnext_small-0c510722.pth" to /root/.cache/torch/hub/checkpoints/convnext_small-0c510722.pth
100%|████████████████████████████████████████████████████████████████████████████████| 192M/192M [00:04<00:00, 43.0MB/s]
epoch     train_loss  valid_loss  error_rate  time
0         0.112615    0.000854    0.000677    00:38
epoch     train_loss  valid_loss  error_rate  time
0         0.010809    0.000541    0.000000    00:47
Training text processing model
epoch     train_loss  valid_loss  accuracy  time    ██████████| 100.00% [105070592/105067061 00:03<00:00]
0         0.468745    0.412001    0.821000  01:09
epoch     train_loss  valid_loss  accuracy  time
Traceback (most recent call last):---------------------------------------| 0.00% [0/390 00:00<?]
  File "/root/tmp/quickstart.py", line 20, in <module>
    learn.fine_tune(2, 1e-2)
  File "/root/fastai/fastai/callback/schedule.py", line 168, in fine_tune
    self.fit_one_cycle(epochs, slice(base_lr/lr_mult, base_lr), pct_start=pct_start, div=div, **kwargs)
  File "/root/fastai/fastai/callback/schedule.py", line 119, in fit_one_cycle
    self.fit(n_epoch, cbs=ParamScheduler(scheds)+L(cbs), reset_opt=reset_opt, wd=wd, start_epoch=start_epoch)
  File "/root/fastai/fastai/learner.py", line 264, in fit
    self._with_events(self._do_fit, 'fit', CancelFitException, self._end_cleanup)
  File "/root/fastai/fastai/learner.py", line 199, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/root/fastai/fastai/learner.py", line 253, in _do_fit
    self._with_events(self._do_epoch, 'epoch', CancelEpochException)
  File "/root/fastai/fastai/learner.py", line 199, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/root/fastai/fastai/learner.py", line 247, in _do_epoch
    self._do_epoch_train()
  File "/root/fastai/fastai/learner.py", line 239, in _do_epoch_train
    self._with_events(self.all_batches, 'train', CancelTrainException)
  File "/root/fastai/fastai/learner.py", line 199, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/root/fastai/fastai/learner.py", line 205, in all_batches
    for o in enumerate(self.dl): self.one_batch(*o)
  File "/root/fastai/fastai/learner.py", line 235, in one_batch
    self._with_events(self._do_one_batch, 'batch', CancelBatchException)
  File "/root/fastai/fastai/learner.py", line 199, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/root/fastai/fastai/learner.py", line 223, in _do_one_batch
    self._do_grad_opt()
  File "/root/fastai/fastai/learner.py", line 211, in _do_grad_opt
    self._with_events(self._backward, 'backward', CancelBackwardException)
  File "/root/fastai/fastai/learner.py", line 199, in _with_events
    try: self(f'before_{event_type}');  f()
  File "/root/fastai/fastai/learner.py", line 207, in _backward
    def _backward(self): self.loss_grad.backward()
  File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 516, in backward
    return handle_torch_function(
  File "/root/pt/lib/python3.10/site-packages/torch/overrides.py", line 1636, in handle_torch_function
    result = torch_func_method(public_api, types, args, kwargs)
  File "/root/fastai/fastai/torch_core.py", line 382, in __torch_function__
    res = super().__torch_function__(func, types, args, ifnone(kwargs, {}))
  File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 1443, in __torch_function__
    ret = func(*args, **kwargs)
  File "/root/pt/lib/python3.10/site-packages/torch/_tensor.py", line 525, in backward
    torch.autograd.backward(
  File "/root/pt/lib/python3.10/site-packages/torch/autograd/__init__.py", line 267, in backward
    _engine_run_backward(
  File "/root/pt/lib/python3.10/site-packages/torch/autograd/graph.py", line 744, in _engine_run_backward
    return Variable._execution_engine.run_backward(  # Calls into the C++ engine to run the backward pass
RuntimeError: unique_by_key failed on 2nd step: hipErrorSharedObjectInitFailed: shared object initialization failed
(pt) root@rocm:~/tmp#

Versions

(pt) root@rocm:~/tmp# wget https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
# For security purposes, please check the contents of collect_env.py before running it.
python collect_env.py
--2024-04-30 22:15:46--  https://raw.githubusercontent.com/pytorch/pytorch/main/torch/utils/collect_env.py
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 22068 (22K) [text/plain]
Saving to: 'collect_env.py'

collect_env.py                100%[=================================================>]  21.55K  --.-KB/s    in 0.002s

2024-04-30 22:15:47 (9.44 MB/s) - 'collect_env.py' saved [22068/22068]

Collecting environment information...
PyTorch version: 2.3.0+rocm6.0
Is debug build: False
CUDA used to build PyTorch: N/A
ROCM used to build PyTorch: 6.0.32830-d62f6a171

OS: Ubuntu 22.04.4 LTS (x86_64)
GCC version: (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
Clang version: Could not collect
CMake version: version 3.22.1
Libc version: glibc-2.35

Python version: 3.10.12 (main, Nov 20 2023, 15:14:05) [GCC 11.4.0] (64-bit runtime)
Python platform: Linux-5.15.0-105-generic-x86_64-with-glibc2.35
Is CUDA available: True
CUDA runtime version: Could not collect
CUDA_MODULE_LOADING set to: LAZY
GPU models and configuration: Radeon RX 7900 XTX (gfx1100)
Nvidia driver version: Could not collect
cuDNN version: Could not collect
HIP runtime version: 6.0.32830
MIOpen runtime version: 3.0.0
Is XNNPACK available: True

CPU:
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      48 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             24
On-line CPU(s) list:                0-23
Vendor ID:                          AuthenticAMD
Model name:                         AMD Ryzen 9 7900X 12-Core Processor
CPU family:                         25
Model:                              97
Thread(s) per core:                 2
Core(s) per socket:                 12
Socket(s):                          1
Stepping:                           2
Frequency boost:                    enabled
CPU max MHz:                        5732.7139
CPU min MHz:                        3000.0000
BogoMIPS:                           9382.28
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local avx512_bf16 clzero irperf xsaveerptr rdpru wbnoinvd cppc arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl avx512vbmi umip pku ospke avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg avx512_vpopcntdq rdpid overflow_recov succor smca fsrm flush_l1d
Virtualization:                     AMD-V
L1d cache:                          384 KiB (12 instances)
L1i cache:                          384 KiB (12 instances)
L2 cache:                           12 MiB (12 instances)
L3 cache:                           64 MiB (2 instances)
NUMA node(s):                       1
NUMA node0 CPU(s):                  0-23
Vulnerability Gather data sampling: Not affected
Vulnerability Itlb multihit:        Not affected
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Not affected
Vulnerability Retbleed:             Not affected
Vulnerability Spec rstack overflow: Mitigation; safe RET
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Retpolines, IBPB conditional, IBRS_FW, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Not affected
Vulnerability Tsx async abort:      Not affected

Versions of relevant libraries:
[pip3] numpy==1.26.3
[pip3] pytorch-ignite==0.5.0.post2
[pip3] pytorch-lightning==2.2.3
[pip3] pytorch-triton-rocm==2.3.0
[pip3] torch==2.3.0+rocm6.0
[pip3] torchaudio==2.3.0+rocm6.0
[pip3] torchmetrics==1.3.2
[pip3] torchvision==0.18.0+rocm6.0
[conda] Could not collect
(pt) root@rocm:~/tmp#

briansp2020 avatar Apr 30 '24 22:04 briansp2020

I just tried pytorch nightly built for ROCm 6.1 and it worked. nightly build for ROCm6.0 still fails. So, I suppose the problem is caused by incompatibility between 6.0 and 6.1? Here is the output from running the same code using pytorch 2.3 (official release)/2.4 nightly ROCm 6.0 and 2.4 nightly ROCm 6.1.

https://gist.github.com/briansp2020/3e5a83cc5b25bd0c69c30174d3f8e696

briansp2020 avatar May 03 '24 17:05 briansp2020