mmpretrain [Bug] Getting exit code 137 in validation step while training

[Bug] Getting exit code 137 in validation step while training

Open idonahum1 opened this issue 5 months ago • 0 comments

Branch

main branch (mmpretrain version)

Describe the bug

Hi,

Im running some tests to train different architectures on a specific dataset. The training is going alright, but once getting to the validation step, at the last iter the validation the process is being killed with 137 (no error is being raised). I watched the ram usage, seems like that ram is running out, thats why I get that 137 exit. I cant find the reason why ram usage is being increased overtime. It only happens in validation step, while in the training step everything go smoothly. This is happening on different architecture, not on a specific one. If I disable the validation step, the training works perfect, but I cant watch the performance of my model over time.

Thinks that I try to solve it:

Changing the batch size.
Change the number of workers.
disable pin_memory.
Changed to a different machine with much more RAM.

Nothing seems to solve the issue. One think that I should say is that the dataset is huge - around 8mill crops, and around 800gb. I used symlink to split it to train - test. so test folder and train folder have files which are actually a symlink to different location.

Any ideas?

Thank you.

Environment

{'sys.platform': 'linux', 'Python': '3.8.19 | packaged by conda-forge | (default, Mar 20 2024, ' '12:47:35) [GCC 12.3.0]', 'CUDA available': True, 'MUSA available': False, 'numpy_random_seed': 2147483648, 'GPU 0,1,2,3': 'NVIDIA L4', 'CUDA_HOME': '/usr', 'NVCC': 'Cuda compilation tools, release 10.1, V10.1.24', 'GCC': 'gcc (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0', 'PyTorch': '1.9.0+cu111', 'TorchVision': '0.10.0+cu111', 'OpenCV': '4.10.0', 'MMEngine': '0.10.4', 'MMCV': '2.1.0', 'MMPreTrain': '1.2.0+17a886c'}

Other information

Config -


# preprocessing settings
train_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(scale=256, type='Resize'),
    dict(type='PackInputs'),
]

test_pipeline = [
    dict(type='LoadImageFromFile'),
    dict(scale=256, type='Resize'),
    dict(type='PackInputs'),
]

# datasets
data_root = '/home/ubuntu/engineCache/atlantis-90-10-split'
dataset_type = 'CustomDataset'
num_classes = 36723

train_dataset = dict(
        data_root=data_root,
        ann_file='meta/train.txt',
        pipeline=train_pipeline,
        data_prefix='train',
        type=dataset_type)

test_dataset = dict(
        data_root=data_root,
        pipeline=test_pipeline,
        ann_file='meta/test.txt',
        data_prefix='test',
        type=dataset_type)

# dataloaders settings
batch_size = 128

train_dataloader = dict(
    batch_size=batch_size,
    collate_fn=dict(type='default_collate'),
    dataset=train_dataset,
    num_workers=8,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=True, type='DefaultSampler'))

val_dataloader = dict(
    batch_size=128,
    collate_fn=dict(type='default_collate'),
    dataset=test_dataset,
    num_workers=2,
    persistent_workers=True,
    pin_memory=True,
    sampler=dict(shuffle=False, type='DefaultSampler'))

test_dataloader = val_dataloader

# model settings
model = dict(
    backbone=dict(
        depth=18,
        num_stages=4,
        out_indices=(3, ),
        style='pytorch',
        type='ResNet'),
    head=dict(
        in_channels=512,
        loss=dict(loss_weight=1.0, type='CrossEntropyLoss'),
        num_classes=num_classes,
        hidden_dim=128,
        topk=(
            1,
            5,
        ),
        type='ElectraLinearClsHead'),
    neck=dict(type='GlobalAveragePooling'),
    type='ImageClassifier')


auto_scale_lr = dict(base_batch_size=512)

data_preprocessor = dict(
    mean=[
        115.875383,
        102.297249,
        91.7643419,
    ],
    num_classes=36723,
    std=[
        71.497372,
        66.883428,
        65.108552,
    ],
    to_rgb=True)


# Hooks settings

default_hooks = dict(
    checkpoint=dict(interval=1, type='CheckpointHook'),
    logger=dict(interval=500, type='LoggerHook'),
    param_scheduler=dict(type='ParamSchedulerHook'),
    sampler_seed=dict(type='DistSamplerSeedHook'),
    timer=dict(type='IterTimerHook'),
    visualization=dict(enable=True, type='VisualizationHook'))

# Environment settings
default_scope = 'mmpretrain'
env_cfg = dict(
    cudnn_benchmark=False,
    dist_cfg=dict(backend='nccl'),
    mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
launcher = 'none'
load_from = None
log_level = 'INFO'

# Optimizer and learning rate settings
optim_wrapper = dict(
    optimizer=dict(lr=0.1, momentum=0.9, type='SGD', weight_decay=0.0001, nesterov=True))
param_scheduler = dict(
    by_epoch=True, gamma=0.1, milestones=[
        10,
        20
    ], type='MultiStepLR')

# Training settings
randomness = dict(deterministic=False, seed=None)
resume = False
test_cfg = dict()
train_cfg = dict(by_epoch=True, max_epochs=30, val_interval=30)
val_cfg = dict()

# Evaluation settings
val_evaluator = dict(
    topk=(
        1,
        5,
    ), type='Accuracy')

test_evaluator = val_evaluator


# Visualizer and result settings
vis_backends = [
    dict(type='LocalVisBackend'),
]
visualizer = dict(
    type='UniversalVisualizer', vis_backends=[
        dict(type='LocalVisBackend'),
    ])
work_dir = '/home/ubuntu/dev/mmpretrain/train_runs'

Aug 28 '24 09:08 idonahum1

mmpretrain mmpretrain copied to clipboard

[Bug] Getting exit code 137 in validation step while training

Branch

Describe the bug

Environment

Other information

mmpretrain
mmpretrain copied to clipboard