Segment-Everything-Everywhere-All-At-Once Can I conduct distributed experimental training on eight GPUs on two servers?

Can I conduct distributed experimental training on eight GPUs on two servers?

Open EricZavier opened this issue 1 year ago • 1 comments

Thanks for your great works.

I have been running through the model recently, but i found that training on a server with four GPUs is a bit slow. therefore, i would to ask

can I conduct distributed experimental training on eight GPUs on two servers?

Dec 05 '23 07:12 EricZavier

Yes, you can do so, however you may need to follow mpi settings, here is what I am using as a job template for job submission:

  job_template:
    name: train_seem_v1_focalt_enc6_fpn_dec10_lang_bs{batch_size_train}_ep{epoch}_scw{spatial_class_weight}_sdw{spatial_dice_weight}_smw{spatial_mask_weight}_nsm{num_spatial_memories}_lr{lr}_ts{top_spatial_layers}_fb{fix_backbone}_fl{fix_lang}_fe{fix_encoder}_spa{spatial_on}_grd{grounding_on}_iterbase_pn_v2_maxi{max_iter}_qsn{spatial_query_num}_mc{max_candidate}
    sku: 4x32G8-V100
    mpi: True
    process_count_per_node: 8
    command:
    - ifconfig
    - export GLOO_SOCKET_IFNAME=eth0
    - export DETECTRON2_DATASETS=/mnt/data
    - export DATASET=/mnt/data
    - export DATASET2=/mnt/data2
    - export VLDATASET=/mnt/xxxx
    - export WANDB_KEY=xxxxx
    - export PATH=$$PATH:/mnt/output/xueyanz/coco_caption/jre1.8.0_321/bin
    - export PYTHONPATH=$$PYTHONPATH:/mnt/output/xueyanz/coco_caption
    - python entry.py train --conf_files configs/seem/focalt_unicl_lang_v1.yaml
      --overrides
      SAVE_DIR /mnt/output/xueyanz/mainzvision/seem_v1_focalt_enc6_fpn_dec10_lang_bs{batch_size_train}_ep{epoch}_scw{spatial_class_weight}_sdw{spatial_dice_weight}_smw{spatial_mask_weight}_nsm{num_spatial_memories}_lr{lr}_ts{top_spatial_layers}_fb{fix_backbone}_fl{fix_lang}_fe{fix_encoder}_spa{spatial_on}_grd{grounding_on}_iterbase_pn_v2_maxi{max_iter}_qsn{spatial_query_num}_mc{max_candidate}
      COCO.INPUT.IMAGE_SIZE 1024
      MODEL.DECODER.HIDDEN_DIM 512
      MODEL.ENCODER.CONVS_DIM 512
      MODEL.ENCODER.MASK_DIM 512
      MODEL.DECODER.NUM_OBJECT_QUERIES 101
      FP16 True
      WANDB True
      SOLVER.MAX_NUM_EPOCHS {epoch}
      SOLVER.BASE_LR {lr}
      SOLVER.FIX_PARAM.backbone {fix_backbone}
      SOLVER.FIX_PARAM.lang_encoder {fix_lang}
      SOLVER.FIX_PARAM.pixel_decoder {fix_encoder}
      MODEL.DECODER.COST_SPATIAL.CLASS_WEIGHT {spatial_class_weight}
      MODEL.DECODER.COST_SPATIAL.MASK_WEIGHT {spatial_mask_weight}
      MODEL.DECODER.COST_SPATIAL.DICE_WEIGHT {spatial_dice_weight}
      MODEL.DECODER.TOP_SPATIAL_LAYERS {top_spatial_layers}
      MODEL.DECODER.SPATIAL.ENABLED {spatial_on}
      MODEL.DECODER.GROUNDING.ENABLED {grounding_on}
      TRAIN.BATCH_SIZE_TOTAL {batch_size_train}
      TRAIN.BATCH_SIZE_PER_GPU {batch_per_gpu_train}
      TEST.BATCH_SIZE_TOTAL {batch_size_test}
      VOC.TEST.BATCH_SIZE_TOTAL {batch_size_test}
      SBD.TEST.BATCH_SIZE_TOTAL {batch_size_test}
      REF.TEST.BATCH_SIZE_TOTAL {batch_size_test}
      WEIGHT {pre_unicl}
      RESUME_FROM {resume_from}
      FIND_UNUSED_PARAMETERS True
      ATTENTION_ARCH.SPATIAL_MEMORIES {num_spatial_memories}
      MODEL.DECODER.SPATIAL.MAX_ITER {max_iter}
      ATTENTION_ARCH.QUERY_NUMBER {spatial_query_num}
      STROKE_SAMPLER.MAX_CANDIDATE {max_candidate}

Dec 24 '23 14:12 MaureenZOU

Segment-Everything-Everywhere-All-At-Once Segment-Everything-Everywhere-All-At-Once copied to clipboard

Can I conduct distributed experimental training on eight GPUs on two servers?

Segment-Everything-Everywhere-All-At-Once
Segment-Everything-Everywhere-All-At-Once copied to clipboard