Segment-Everything-Everywhere-All-At-Once
Segment-Everything-Everywhere-All-At-Once copied to clipboard
Can I conduct distributed experimental training on eight GPUs on two servers?
Thanks for your great works.
I have been running through the model recently, but i found that training on a server with four GPUs is a bit slow. therefore, i would to ask
can I conduct distributed experimental training on eight GPUs on two servers?
Yes, you can do so, however you may need to follow mpi settings, here is what I am using as a job template for job submission:
job_template:
name: train_seem_v1_focalt_enc6_fpn_dec10_lang_bs{batch_size_train}_ep{epoch}_scw{spatial_class_weight}_sdw{spatial_dice_weight}_smw{spatial_mask_weight}_nsm{num_spatial_memories}_lr{lr}_ts{top_spatial_layers}_fb{fix_backbone}_fl{fix_lang}_fe{fix_encoder}_spa{spatial_on}_grd{grounding_on}_iterbase_pn_v2_maxi{max_iter}_qsn{spatial_query_num}_mc{max_candidate}
sku: 4x32G8-V100
mpi: True
process_count_per_node: 8
command:
- ifconfig
- export GLOO_SOCKET_IFNAME=eth0
- export DETECTRON2_DATASETS=/mnt/data
- export DATASET=/mnt/data
- export DATASET2=/mnt/data2
- export VLDATASET=/mnt/xxxx
- export WANDB_KEY=xxxxx
- export PATH=$$PATH:/mnt/output/xueyanz/coco_caption/jre1.8.0_321/bin
- export PYTHONPATH=$$PYTHONPATH:/mnt/output/xueyanz/coco_caption
- python entry.py train --conf_files configs/seem/focalt_unicl_lang_v1.yaml
--overrides
SAVE_DIR /mnt/output/xueyanz/mainzvision/seem_v1_focalt_enc6_fpn_dec10_lang_bs{batch_size_train}_ep{epoch}_scw{spatial_class_weight}_sdw{spatial_dice_weight}_smw{spatial_mask_weight}_nsm{num_spatial_memories}_lr{lr}_ts{top_spatial_layers}_fb{fix_backbone}_fl{fix_lang}_fe{fix_encoder}_spa{spatial_on}_grd{grounding_on}_iterbase_pn_v2_maxi{max_iter}_qsn{spatial_query_num}_mc{max_candidate}
COCO.INPUT.IMAGE_SIZE 1024
MODEL.DECODER.HIDDEN_DIM 512
MODEL.ENCODER.CONVS_DIM 512
MODEL.ENCODER.MASK_DIM 512
MODEL.DECODER.NUM_OBJECT_QUERIES 101
FP16 True
WANDB True
SOLVER.MAX_NUM_EPOCHS {epoch}
SOLVER.BASE_LR {lr}
SOLVER.FIX_PARAM.backbone {fix_backbone}
SOLVER.FIX_PARAM.lang_encoder {fix_lang}
SOLVER.FIX_PARAM.pixel_decoder {fix_encoder}
MODEL.DECODER.COST_SPATIAL.CLASS_WEIGHT {spatial_class_weight}
MODEL.DECODER.COST_SPATIAL.MASK_WEIGHT {spatial_mask_weight}
MODEL.DECODER.COST_SPATIAL.DICE_WEIGHT {spatial_dice_weight}
MODEL.DECODER.TOP_SPATIAL_LAYERS {top_spatial_layers}
MODEL.DECODER.SPATIAL.ENABLED {spatial_on}
MODEL.DECODER.GROUNDING.ENABLED {grounding_on}
TRAIN.BATCH_SIZE_TOTAL {batch_size_train}
TRAIN.BATCH_SIZE_PER_GPU {batch_per_gpu_train}
TEST.BATCH_SIZE_TOTAL {batch_size_test}
VOC.TEST.BATCH_SIZE_TOTAL {batch_size_test}
SBD.TEST.BATCH_SIZE_TOTAL {batch_size_test}
REF.TEST.BATCH_SIZE_TOTAL {batch_size_test}
WEIGHT {pre_unicl}
RESUME_FROM {resume_from}
FIND_UNUSED_PARAMETERS True
ATTENTION_ARCH.SPATIAL_MEMORIES {num_spatial_memories}
MODEL.DECODER.SPATIAL.MAX_ITER {max_iter}
ATTENTION_ARCH.QUERY_NUMBER {spatial_query_num}
STROKE_SAMPLER.MAX_CANDIDATE {max_candidate}