FastSAM icon indicating copy to clipboard operation
FastSAM copied to clipboard

Problems during training

Open MacieJayDaaaaa opened this issue 1 year ago • 0 comments

I encountered some issues when training with sa-1b 1.duplicate labels are removed train: WARNING D:\code\cocostyle\train\images\sa_99946.jpg: 1 duplicate labels removed train: WARNING D:\code\cocostyle\train\images\sa_99948.jpg: 1 duplicate labels removed train: WARNING D:\code\cocostyle\train\images\sa_99977.jpg: 1 duplicate labels removed train: WARNING D:\code\cocostyle\train\images\sa_99980.jpg: 2 duplicate labels removed

2.An error occurred while saving labels. cache

`Traceback (most recent call last):
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\dataset.py", line 108, in get_labels
    cache, exists = np.load(str(cache_path), allow_pickle=True).item(), True  # load dict
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\numpy\lib\npyio.py", line 427, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: 'D:\\code\\cocostyle\\train\\labels.cache'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\admin\AppData\Roaming\Ultralytics\DDP\_temp_isnrh7go2032266927360.py", line 9, in <module>
    trainer.train()
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 192, in train
    self._do_train(world_size)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 275, in _do_train
    self._setup_train(world_size)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 239, in _setup_train
    self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode='train')
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\v8\detect\train.py", line 54, in get_dataloader
    dataset = self.build_dataset(dataset_path, mode, batch_size)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\v8\detect\train.py", line 28, in build_dataset
    return build_yolo_dataset(self.args, img_path, batch, self.data, mode=mode, rect=mode == 'val', stride=gs)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\build.py", line 74, in build_yolo_dataset
    return YOLODataset(
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\dataset.py", line 39, in __init__
    super().__init__(*args, **kwargs)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\base.py", line 72, in __init__
    self.labels = self.get_labels()
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\dataset.py", line 113, in get_labels
    cache, exists = self.cache_labels(cache_path), False  # run cache ops
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\data\dataset.py", line 94, in cache_labels
    np.save(str(path), x)  # save cache for next time
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\numpy\lib\npyio.py", line 546, in save
    format.write_array(fid, arr, allow_pickle=allow_pickle,
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\numpy\lib\format.py", line 719, in write_array
    pickle.dump(array, fp, protocol=3, **pickle_kwargs)
MemoryError
Traceback (most recent call last):
  File "C:\Users\admin\AppData\Roaming\Ultralytics\DDP\_temp_isnrh7go2032266927360.py", line 9, in <module>
    trainer.train()
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 192, in train
    self._do_train(world_size)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 275, in _do_train
    self._setup_train(world_size)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\engine\trainer.py", line 239, in _setup_train
    self.train_loader = self.get_dataloader(self.trainset, batch_size=batch_size, rank=RANK, mode='train')
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\v8\detect\train.py", line 53, in get_dataloader
    with torch_distributed_zero_first(rank):  # init dataset *.cache only once if DDP
  File "C:\Users\admin\.conda\envs\fastsam\lib\contextlib.py", line 119, in __enter__
    return next(self.gen)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\ultralytics\yolo\utils\torch_utils.py", line 40, in torch_distributed_zero_first
    dist.barrier(device_ids=[local_rank])
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\torch\distributed\c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\admin\.conda\envs\fastsam\lib\site-packages\torch\distributed\distributed_c10d.py", line 3703, in barrier
    work.wait()
RuntimeError: [C:\actions-runner\_work\pytorch\pytorch\builder\windows\pytorch\third_party\gloo\gloo\transport\uv\unbound_buffer.cc:67] Timed out waiting 3600000ms for recv operation to complete
[2023-12-05 15:38:13,105] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 11676 closing signal CTRL_C_EVENT
[2023-12-05 15:38:43,140] torch.distributed.elastic.multiprocessing.api: [WARNING] Unable to shutdown process 11676 via Signals.CTRL_C_EVENT, forcefully exiting via Signals.CTRL_C_EVENT`

How to solve these two problems?

MacieJayDaaaaa avatar Dec 05 '23 07:12 MacieJayDaaaaa