autotrain-advanced [BUG] AMD ROCm -- HIP out of memory. Tried to allocate...

Prerequisites

[X] I have read the documentation.
[X] I have checked other issues for similar problems.

Backend

Local

Interface Used

CLI

CLI Command

INFO     | 2024-08-21 22:52:33 | autotrain.backends.local:create:8 - Starting local training...
INFO     | 2024-08-21 22:52:33 | autotrain.commands:launch_command:478 - ['accelerate', 'launch', '--num_machines', '1', '--num_processes', '1', '--mixed_precision', 'fp16', '-m', 'autotrain.trainers.sent_transformers', '--training_config', 'gemma-2-2b-oh-devinator/training_params.json']
INFO     | 2024-08-21 22:52:33 | autotrain.commands:launch_command:479 - {'data_path': 'skratos115/opendevin_DataDevinator', 'model': 'google/gemma-2-2b-it', 'lr': 5e-05, 'epochs': 6, 'max_seq_length': 128, 'batch_size': 12, 'warmup_ratio': 0.1, 'gradient_accumulation': 1, 'optimizer': 'adamw_torch', 'scheduler': 'linear', 'weight_decay': 0.0, 'max_grad_norm': 1.0, 'seed': 42, 'train_split': 'train', 'valid_split': None, 'logging_steps': -1, 'project_name': 'gemma-2-2b-oh-devinator', 'auto_find_batch_size': False, 'mixed_precision': 'fp16', 'save_total_limit': 1, 'token': '*****', 'push_to_hub': True, 'eval_strategy': 'epoch', 'username': 'unclemusclez', 'log': 'tensorboard', 'early_stopping_patience': 5, 'early_stopping_threshold': 0.01, 'trainer': 'pair_score', 'sentence1_column': 'prompt', 'sentence2_column': 'solution', 'sentence3_column': 'sentence3', 'target_column': 'grade'}
INFO     | 2024-08-21 22:52:33 | autotrain.backends.local:create:13 - Training PID: 19611
INFO:     127.0.0.1:36080 - "POST /ui/create_project HTTP/1.1" 200 OK
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:641: UserWarning: Can't initialize amdsmi - Error code: 34
  warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
INFO:     127.0.0.1:36080 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     127.0.0.1:36080 - "GET /ui/is_model_training HTTP/1.1" 200 OK
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:641: UserWarning: Can't initialize amdsmi - Error code: 34
  warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
INFO:     127.0.0.1:36080 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO:     127.0.0.1:55534 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     127.0.0.1:55536 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2024-08-21 22:52:41 | __main__:train:111 - Logging steps: 25
INFO     | 2024-08-21 22:52:44 | __main__:train:114 - Train data: Dataset({
    features: ['instruction', 'sentence1', 'sentence2', 'category', 'score', 'Generated with', 'Made By'],
    num_rows: 4784
})
No sentence-transformers model found with name google/gemma-2-2b-it. Creating a new one with mean pooling.

UI Screenshots & Parameters

No response

Error Logs

Loading checkpoint shards: 100%|██████████| 2/2 [02:05<00:00, 52.90s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [02:05<00:00, 62.97s/it]
INFO:     127.0.0.1:37648 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO:     127.0.0.1:37648 - "GET /ui/is_model_training HTTP/1.1" 200 OK
INFO     | 2024-08-21 22:55:59 | __main__:train:195 - Setting up training arguments...
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:641: UserWarning: Can't initialize amdsmi - Error code: 34
  warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
INFO     | 2024-08-21 22:55:59 | __main__:train:203 - Setting up trainer...
When using the Trainer, CodeCarbonCallback requires the `codecarbon` package, which is not compatible with AMD ROCm (https://github.com/mlco2/codecarbon/pull/490). Automatically disabling the codecarbon callback. Reference: https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.report_to.
The dataset `id` 'skratos115/opendevin_data_devinator' does not exist on the Hub. Setting the `id` to None.
INFO     | 2024-08-21 22:56:02 | __main__:train:212 - Starting training...
[2024-08-21 22:56:02,704] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
INFO     | 2024-08-21 22:56:03 | autotrain.trainers.common:on_train_begin:230 - Starting to train...

  0%|          | 0/2394 [00:00<?, ?it/s]INFO:     127.0.0.1:41416 - "GET /ui/is_model_training HTTP/1.1" 200 OK
ERROR    | 2024-08-21 22:56:04 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/common.py", line 117, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/sent_transformers/__main__.py", line 213, in train
    trainer.train()
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 1955, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 2296, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 3380, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/trainer.py", line 329, in compute_loss
    loss = loss_fn(features, labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in forward
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in <listcomp>
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 819, in forward
    return model_forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/accelerate/utils/operations.py", line 807, in __call__
    return convert_to_fp32(self.model_forward(*args, **kwargs))
                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/amp/autocast_mode.py", line 43, in decorate_autocast
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/container.py", line 219, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/models/Transformer.py", line 118, in forward
    output_states = self.auto_model(**trans_features, return_dict=False)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 862, in forward
    layer_outputs = decoder_layer(
                    ^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 583, in forward
    hidden_states = self.input_layernorm(hidden_states)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 120, in forward
    output = self._norm(x.float())
             ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/models/gemma2/modeling_gemma2.py", line 117, in _norm
    return x * torch.rsqrt(x.pow(2).mean(-1, keepdim=True) + self.eps)
                           ^^^^^^^^
torch.OutOfMemoryError: HIP out of memory. Tried to allocate 124.00 MiB. GPU 0 has a total capacity of 19.94 GiB of which 554.47 MiB is free. Of the allocated memory 15.87 GiB is allocated by PyTorch, and 165.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

ERROR    | 2024-08-21 22:56:04 | autotrain.trainers.common:wrapper:121 - HIP out of memory. Tried to allocate 124.00 MiB. GPU 0 has a total capacity of 19.94 GiB of which 554.47 MiB is free. Of the allocated memory 15.87 GiB is allocated by PyTorch, and 165.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:641: UserWarning: Can't initialize amdsmi - Error code: 34
  warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
INFO:     127.0.0.1:41416 - "GET /ui/accelerators HTTP/1.1" 200 OK
INFO     | 2024-08-21 22:56:09 | autotrain.app.utils:get_running_jobs:26 - Killing PID: 19611
INFO     | 2024-08-21 22:56:09 | autotrain.app.utils:kill_process_by_pid:52 - Sent SIGTERM to process with PID 19611

Additional Information

I am running on Windows 11 WSL2 Ubunto 22.04, amdsmi is currently unavailable within this distribution.

Aug 22 '24 03:08 unclemusclez

trying export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True gives em the same issue:

torch.OutOfMemoryError: HIP out of memory. Tried to allocate 124.00 MiB. GPU 0 has a total capacity of 19.94 GiB of which 486.33 MiB is free. Of the allocated memory 15.87 GiB is allocated by PyTorch, and 165.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

ERROR    | 2024-08-21 23:29:55 | autotrain.trainers.common:wrapper:121 - HIP out of memory. Tried to allocate 124.00 MiB. GPU 0 has a total capacity of 19.94 GiB of which 486.33 MiB is free. Of the allocated memory 15.87 GiB is allocated by PyTorch, and 165.29 MiB is reserved by PyTorch but unallocated. If reserved but unallocated memory is large try setting PYTORCH_HIP_ALLOC_CONF=expandable_segments:True to avoid fragmentation.  See documentation for Memory Management  (https://pytorch.org/docs/stable/notes/cuda.html#environment-variables)

Aug 22 '24 03:08 unclemusclez

this is a pruned version of the log since im using the webui and i wawnted to post this problem to discord. Seems to run with my CPU on by ram, not vram.

The following values were not passed to `accelerate launch` and had defaults used instead:
        `--num_processes` was set to a value of `0`
        `--num_machines` was set to a value of `1`
        `--mixed_precision` was set to a value of `'no'`
        `--dynamo_backend` was set to a value of `'no'`
To avoid this warning pass in values for each of the problematic parameters or run `accelerate config`.
Loading checkpoint shards:   0%|          | 0/2 [00:00<?, ?it/s]INFO:     127.0.0.1:49188 - "GET /ui/accelerators HTTP/1.1" 200 OK
Loading checkpoint shards:  50%|█████     | 1/2 [02:26<02:26, 146.02s/it]INFO:     127.0.0.1:42740 - "GET /ui/is_model_training HTTP/1.1" 200 OK
Loading checkpoint shards: 100%|██████████| 2/2 [02:31<00:00, 63.56s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [02:31<00:00, 75.93s/it]
When using the Trainer, CodeCarbonCallback requires the `codecarbon` package, which is not compatible with AMD ROCm (https://github.com/mlco2/codecarbon/pull/490). Automatically disabling the codecarbon callback. Reference: https://hugg>The dataset `id` 
[2024-08-22 00:19:03,898] [WARNING] [real_accelerator.py:162:get_accelerator] Setting accelerator to CPU. If you have GPU or other accelerator, we were unable to detect it.
[2024-08-22 00:19:03,899] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cpu (auto detect)
INFO     | 2024-08-22 00:19:04 | autotrain.trainers.common:on_train_begin:230 - Starting to train...

  0%|          | 0/28704 [00:00<?, ?it/s]INFO:     127.0.0.1:42740 - "GET /ui/is_model_training HTTP/1.1" 200 OK
  0%|          | 1/28704 [02:37<1254:19:31, 157.32s/it]INFO:     127.0.0.1:47208 -  200 OK
  0%|          | 2/28704 [05:34<1345:29:58, 168.76s/it]INFO:     127.0.0.1:33026 -  200 OK
  0%|          | 3/28704 [08:27<1361:17:05, 170.75s/it]INFO:     127.0.0.1:37782 -  200 OK

startscript.sh

#!/bin/bash
export HIP_VISIBLE_DEVICES=1
export PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
export HF_TOKEN=***

source ~/ComfyUI/.venv/bin/activate

autotrain app --host 127.0.0.1 --port 8200

Aug 22 '24 04:08 unclemusclez

Progress with export HIP_VISIBLE_DEVICES=0,1 (should be just 1)

Loading checkpoint shards:  50%|█████     | 1/2 [01:59<01:59, 120.00s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [02:05<00:00, 52.90s/it]
Loading checkpoint shards: 100%|██████████| 2/2 [02:05<00:00, 62.97s/it]
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py:1160: UserWarning: expandable_segments not supported on this platform (Triggered internally at ../c10/hip/HIPAllocatorConfig.h:29.)
  return t.to(
INFO     | 2024-08-22 02:25:11 | __main__:train:195 - Setting up training arguments...
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/cuda/__init__.py:641: UserWarning: Can't initialize amdsmi - Error code: 34
  warnings.warn(f"Can't initialize amdsmi - Error code: {e.err_code}")
INFO     | 2024-08-22 02:25:11 | __main__:train:203 - Setting up trainer...
When using the Trainer, CodeCarbonCallback requires the `codecarbon` package, which is not compatible with AMD ROCm (https://github.com/mlco2/codecarbon/pull/490). Automatically disabling the codecarbon callback. Reference: https://huggingface.co/docs/transformers/v4.39.3/en/main_classes/trainer#transformers.TrainingArguments.report_to.
The dataset `id` 'skratos115/opendevin_data_devinator' does not exist on the Hub. Setting the `id` to None.
INFO     | 2024-08-22 02:25:14 | __main__:train:212 - Starting training...
[2024-08-22 02:25:14,363] [INFO] [real_accelerator.py:203:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
INFO     | 2024-08-22 02:25:15 | autotrain.trainers.common:on_train_begin:230 - Starting to train...

  0%|          | 0/28704 [00:00<?, ?it/s]ERROR    | 2024-08-22 02:25:32 | autotrain.trainers.common:wrapper:120 - train has failed due to an exception: Traceback (most recent call last):
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/common.py", line 117, in wrapper
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/autotrain/trainers/sent_transformers/__main__.py", line 213, in train
    trainer.train()
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 1955, in train
    return inner_training_loop(
           ^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 2296, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/transformers/trainer.py", line 3380, in training_step
    loss = self.compute_loss(model, inputs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/trainer.py", line 329, in compute_loss
    loss = loss_fn(features, labels)
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in forward
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/losses/CoSENTLoss.py", line 79, in <listcomp>
    embeddings = [self.model(sentence_feature)["sentence_embedding"] for sentence_feature in sentence_features]
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/container.py", line 219, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1553, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/nn/modules/module.py", line 1562, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/sentence_transformers/models/Pooling.py", line 153, in forward
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: HIP error: no kernel image is available for execution on the device
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.


ERROR    | 2024-08-22 02:25:32 | autotrain.trainers.common:wrapper:121 - HIP error: no kernel image is available for execution on the device
HIP kernel errors might be asynchronously reported at some other API call, so the stacktrace below might be incorrect.
For debugging consider passing AMD_SERIALIZE_KERNEL=3
Compile with `TORCH_USE_HIP_DSA` to enable device-side assertions.

terminate called after throwing an instance of 'c10::Error'
  what():  invalid device pointer: 0x5400000
Exception raised from free at ../c10/hip/HIPCachingAllocator.cpp:2994 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x96 (0x7f2293e29096 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #1: c10::detail::torchCheckFail(char const*, char const*, unsigned int, std::string const&) + 0x64 (0x7f2293dd7de0 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #2: <unknown function> + 0x1d15e (0x7f22d99fe15e in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10_hip.so)
frame #3: <unknown function> + 0x5de6d0 (0x7f2364e826d0 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x6bcef (0x7f2293e0ccef in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #5: c10::TensorImpl::~TensorImpl() + 0x21b (0x7f2293e05d4b in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #6: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2293e05ef9 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #7: torch::autograd::SavedVariable::reset_data() + 0xec (0x7f23513b369c in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #8: <unknown function> + 0x46714d1 (0x7f23506ee4d1 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #9: <unknown function> + 0x52f7812 (0x7f2351374812 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #10: torch::autograd::deleteNode(torch::autograd::Node*) + 0x7f (0x7f23513748af in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #11: std::_Sp_counted_deleter<torch::autograd::generated::MulBackward0*, void (*)(torch::autograd::Node*), std::allocator<void>, (__gnu_cxx::_Lock_policy)2>::_M_dispose() + 0xe (0x7f235082585e in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #12: <unknown function> + 0x52d7db0 (0x7f2351354db0 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_cpu.so)
frame #13: c10::TensorImpl::~TensorImpl() + 0x212 (0x7f2293e05d42 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #14: c10::TensorImpl::~TensorImpl() + 0x9 (0x7f2293e05ef9 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libc10.so)
frame #15: <unknown function> + 0x89a5f8 (0x7f236513e5f8 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #16: THPVariable_subclass_dealloc(_object*) + 0x2c6 (0x7f236513e946 in /home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/torch/lib/libtorch_python.so)
frame #17: /home/musclez/ComfyUI/.venv/bin/python() [0x4e4efe]
frame #18: /home/musclez/ComfyUI/.venv/bin/python() [0x4d394d]
frame #19: /home/musclez/ComfyUI/.venv/bin/python() [0x53dd0e]
frame #20: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc79]
frame #21: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc5e]
frame #22: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc5e]
frame #23: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc5e]
frame #24: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc5e]
frame #25: /home/musclez/ComfyUI/.venv/bin/python() [0x53bc5e]
frame #26: /home/musclez/ComfyUI/.venv/bin/python() [0x5508da]
frame #27: _PyEval_EvalFrameDefault + 0x7df9 (0x502659 in /home/musclez/ComfyUI/.venv/bin/python)
frame #28: /home/musclez/ComfyUI/.venv/bin/python() [0x62e1b4]
frame #29: PyEval_EvalCode + 0x97 (0x4f3a67 in /home/musclez/ComfyUI/.venv/bin/python)
frame #30: /home/musclez/ComfyUI/.venv/bin/python() [0x569033]
frame #31: /home/musclez/ComfyUI/.venv/bin/python() [0x50d977]
frame #32: PyObject_Vectorcall + 0x35 (0x50d745 in /home/musclez/ComfyUI/.venv/bin/python)
frame #33: _PyEval_EvalFrameDefault + 0x8f2 (0x4fb152 in /home/musclez/ComfyUI/.venv/bin/python)
frame #34: _PyFunction_Vectorcall + 0x173 (0x531823 in /home/musclez/ComfyUI/.venv/bin/python)
frame #35: /home/musclez/ComfyUI/.venv/bin/python() [0x64fd94]
frame #36: Py_RunMain + 0x142 (0x64f5a2 in /home/musclez/ComfyUI/.venv/bin/python)
frame #37: Py_BytesMain + 0x2d (0x61ee0d in /home/musclez/ComfyUI/.venv/bin/python)
frame #38: <unknown function> + 0x29d90 (0x7f2366e17d90 in /lib/x86_64-linux-gnu/libc.so.6)
frame #39: __libc_start_main + 0x80 (0x7f2366e17e40 in /lib/x86_64-linux-gnu/libc.so.6)
frame #40: _start + 0x25 (0x61ec95 in /home/musclez/ComfyUI/.venv/bin/python)

Traceback (most recent call last):
  File "/home/musclez/ComfyUI/.venv/bin/accelerate", line 8, in <module>
    sys.exit(main())
             ^^^^^^
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/accelerate/commands/accelerate_cli.py", line 48, in main
    args.func(args)
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 1106, in launch_command
    simple_launcher(args)
  File "/home/musclez/ComfyUI/.venv/lib/python3.11/site-packages/accelerate/commands/launch.py", line 704, in simple_launcher
    raise subprocess.CalledProcessError(returncode=process.returncode, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/musclez/ComfyUI/.venv/bin/python', '-m', 'autotrain.trainers.sent_transformers', '--training_config', 'gemma-2-2b-it-oh-bf16-8192-hip/training_params.json']' died with <Signals.SIGABRT: 6>.

This error perhaps pertains to WSL linux. There is no kernel driver for WSL 2 linux AMD ROCm.

Aug 22 '24 06:08 unclemusclez

hi. unfortunately, we dont support AMD devices yet.

Aug 22 '24 08:08 abhishekkrthakur

hi. unfortunately, we dont support AMD devices yet.

tbh you might. i think this is a pytorch specific issue.

i just wanted to document all of this so other people attempting similar functionality can have a reference. If there is a more appropriate place to contribute, i'd love to participate.

Aug 22 '24 08:08 unclemusclez

tbh you might.

Taking a look. I dont have an amd device myself so i'll have to count on you (and the community) for testing purposes :)

Aug 22 '24 08:08 abhishekkrthakur

tbh you might.

Taking a look. I dont have an amd device myself so i'll have to count on you (and the community) for testing purposes :)

i think there is a pytorch specific issue that is separate of which i came across: https://github.com/pytorch/pytorch/issues/134208

the out of memory might literally be because i was out of memory. I was able to perform this training on CPU. When utilizing the GPU for Gemma 2B-it, it runs for like 3 seconds and crashed with the above

Aug 22 '24 08:08 unclemusclez

after a couple of discussions, it seems amd gpus should work out of the box for autotrain. did you install rocm pytorch wheels?

Aug 22 '24 10:08 abhishekkrthakur

after a couple of discussions, it seems amd gpus should work out of the box for autotrain. did you install rocm pytorch wheels?

no, i am using WSL2 with Ubuntu 22.04. This limits me to ROCm 6.1.3 and technically i'm supposed to use the older version of pytorch, i think 2.1 https://rocm.docs.amd.com/projects/radeon/en/docs-6.1.3/docs/compatibility/wsl/wsl_compatibility.html. also, their wheels are for python 3.10 and i am on 3.11. I could always downgrade but last i checked, the wheels were not updated recently and they had some deeper issues regarding bitsandbytes, which were resovled in ROCm 6.2, of which is not available for WSL2.

Now that i know it works with wheels, does it work on Navi31? or is this specific to MI300/CDNA. I am on a 7900XT, which is not a professional series nor enterprise card, but it supports RDNA3.

Aug 22 '24 10:08 unclemusclez

This issue is stale because it has been open for 30 days with no activity.

Sep 21 '24 15:09 github-actions[bot]

🤘🚀

Sep 21 '24 15:09 unclemusclez