YOLO-World
YOLO-World copied to clipboard
Differences in validation logs during model pre-training.
Hello, I found some differences in logs when I was doing model pre-training. I used O365+GoldG dataset for pre-training, 8X4090 graphics card, and the batch_size of each GPU during training was 64, and the batch_size during validation was 1, which is the same as your public training log settings. The number of iterations per round during training is 2693, but I found that the number of validation sets is 620, while mine is 4809. Is there anything wrong with my settings? Or how can I align the data volume of your validation set? The following is the output content configured in my training log:
System environment:
sys.platform: linux
Python: 3.9.19 (main, Mar 21 2024, 17:11:28) [GCC 11.2.0]
CUDA available: True
MUSA available: False
numpy_random_seed: 1125387415
GPU 0,1,2,3,4,5,6,7: NVIDIA GeForce RTX 4090
CUDA_HOME: /usr/local/cuda
NVCC: Cuda compilation tools, release 12.4, V12.4.131
GCC: gcc (Ubuntu 13.2.0-23ubuntu4) 13.2.0
PyTorch: 2.0.1+cu118
PyTorch compiling details: PyTorch built with:
- GCC 9.3
- C++ Version: 201703
- Intel(R) oneAPI Math Kernel Library Version 2022.2-Product Build 20220804 for Intel(R) 64 architecture applications
- Intel(R) MKL-DNN v2.7.3 (Git Hash 6dbeffbae1f23cbbeae17adb7b5b13f1f37c080e)
- OpenMP 201511 (a.k.a. OpenMP 4.5)
- LAPACK is enabled (usually provided by MKL)
- NNPACK is enabled
- CPU capability usage: AVX2
- CUDA Runtime 11.8
- NVCC architecture flags: -gencode;arch=compute_37,code=sm_37;-gencode;arch=compute_50,code=sm_50;-gencode;arch=compute_60,code=sm_60;-gencode;arch=compute_70,code=sm_70;-gencode;arch=compute_75,code=sm_75;-gencode;arch=compute_80,code=sm_80;-gencode;arch=compute_86,code=sm_86;-gencode;arch=compute_90,code=sm_90
- CuDNN 8.7
- Magma 2.6.1
- Build settings: BLAS_INFO=mkl, BUILD_TYPE=Release, CUDA_VERSION=11.8, CUDNN_VERSION=8.7.0, CXX_COMPILER=/opt/rh/devtoolset-9/root/usr/bin/c++, CXX_FLAGS= -D_GLIBCXX_USE_CXX11_ABI=0 -fabi-version=11 -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -DNDEBUG -DUSE_KINETO -DLIBKINETO_NOROCTRACER -DUSE_FBGEMM -DUSE_QNNPACK -DUSE_PYTORCH_QNNPACK -DUSE_XNNPACK -DSYMBOLICATE_MOBILE_DEBUG_HANDLE -O2 -fPIC -Wall -Wextra -Werror=return-type -Werror=non-virtual-dtor -Werror=bool-operation -Wnarrowing -Wno-missing-field-initializers -Wno-type-limits -Wno-array-bounds -Wno-unknown-pragmas -Wunused-local-typedefs -Wno-unused-parameter -Wno-unused-function -Wno-unused-result -Wno-strict-overflow -Wno-strict-aliasing -Wno-error=deprecated-declarations -Wno-stringop-overflow -Wno-psabi -Wno-error=pedantic -Wno-error=redundant-decls -Wno-error=old-style-cast -fdiagnostics-color=always -faligned-new -Wno-unused-but-set-variable -Wno-maybe-uninitialized -fno-math-errno -fno-trapping-math -Werror=format -Werror=cast-function-type -Wno-stringop-overflow, LAPACK_INFO=mkl, PERF_WITH_AVX=1, PERF_WITH_AVX2=1, PERF_WITH_AVX512=1, TORCH_DISABLE_GPU_ASSERTS=ON, TORCH_VERSION=2.0.1, USE_CUDA=ON, USE_CUDNN=ON, USE_EXCEPTION_PTR=1, USE_GFLAGS=OFF, USE_GLOG=OFF, USE_MKL=ON, USE_MKLDNN=ON, USE_MPI=OFF, USE_NCCL=1, USE_NNPACK=ON, USE_OPENMP=ON, USE_ROCM=OFF,
TorchVision: 0.15.2+cu118
OpenCV: 4.9.0
MMEngine: 0.10.3
Runtime environment:
cudnn_benchmark: True
mp_cfg: {'mp_start_method': 'fork', 'opencv_num_threads': 0}
dist_cfg: {'backend': 'nccl'}
seed: 1125387415
Distributed launcher: pytorch
Distributed training: True
GPU number: 8
------------------------------------------------------------
2024/06/19 07:02:53 - mmengine - INFO - Config:
_backend_args = None
_multiscale_resize_transforms = [
dict(
transforms=[
dict(scale=(
640,
640,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
640,
640,
),
type='LetterResize'),
],
type='Compose'),
dict(
transforms=[
dict(scale=(
320,
320,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
320,
320,
),
type='LetterResize'),
],
type='Compose'),
dict(
transforms=[
dict(scale=(
960,
960,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
960,
960,
),
type='LetterResize'),
],
type='Compose'),
]
affine_scale = 0.5
albu_train_transforms = [
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
]
backend_args = None
base_lr = 0.001
close_mosaic_epochs = 20
coco_val_dataset = dict(
class_text_path='data/texts/lvis_v1_class_texts.json',
dataset=dict(
ann_file='lvis_v1_minival_inserted_image_name.json',
batch_shapes_cfg=None,
data_prefix=dict(img=''),
data_root='/home/dq/datasets/coco/',
test_mode=True,
type='YOLOv5LVISV1Dataset'),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
288,
288,
),
type='LetterResize'),
dict(_scope_='mmdet', type='LoadAnnotations', with_bbox=True),
dict(type='LoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'scale_factor',
'pad_param',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='MultiModalDataset')
custom_hooks = [
dict(
ema_type='ExpMomentumEMA',
momentum=0.0001,
priority=49,
strict_load=False,
type='EMAHook',
update_buffers=True),
dict(
switch_epoch=80,
switch_pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=True,
pad_val=dict(img=114.0),
scale=(
288,
288,
),
type='LetterResize'),
dict(
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='mmdet.PipelineSwitchHook'),
]
custom_imports = dict(
allow_failed_imports=False, imports=[
'yolo_world',
])
data_root = '/home/dq/datasets/'
deepen_factor = 0.33
default_hooks = dict(
checkpoint=dict(
interval=5,
max_keep_ckpts=2,
rule='greater',
save_best='auto',
type='CheckpointHook'),
logger=dict(interval=50, type='LoggerHook'),
param_scheduler=dict(
lr_factor=0.01,
max_epochs=100,
scheduler_type='linear',
type='YOLOv5ParamSchedulerHook'),
sampler_seed=dict(type='DistSamplerSeedHook'),
timer=dict(type='IterTimerHook'),
visualization=dict(type='mmdet.DetVisualizationHook'))
default_scope = 'mmyolo'
env_cfg = dict(
cudnn_benchmark=True,
dist_cfg=dict(backend='nccl'),
mp_cfg=dict(mp_start_method='fork', opencv_num_threads=0))
flickr_train_dataset = dict(
ann_file='final_flickr_separateGT_train.json',
data_prefix=dict(img='images/'),
data_root='/home/dq/datasets/flickr/',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='YOLOv5MixedGroundingDataset')
img_scale = (
288,
288,
)
img_scales = [
(
640,
640,
),
(
320,
320,
),
(
960,
960,
),
]
last_stage_out_channels = 1024
last_transform = [
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
),
type='mmdet.PackDetInputs'),
]
launcher = 'pytorch'
load_from = None
log_level = 'INFO'
log_processor = dict(by_epoch=True, type='LogProcessor', window_size=50)
loss_bbox_weight = 7.5
loss_cls_weight = 0.5
loss_dfl_weight = 0.375
lr_factor = 0.01
max_aspect_ratio = 100
max_epochs = 100
max_keep_ckpts = 2
mg_train_dataset = dict(
ann_file='final_mixed_train_no_coco.json',
data_prefix=dict(img='images/'),
data_root='/home/dq/datasets/mixed_grounding/',
filter_cfg=dict(filter_empty_gt=False, min_size=32),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='YOLOv5MixedGroundingDataset')
min_area_ratio = 0.01
model = dict(
backbone=dict(
image_model=dict(
act_cfg=dict(inplace=True, type='SiLU'),
arch='P5',
deepen_factor=0.33,
last_stage_out_channels=1024,
norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
type='YOLOv8CSPDarknet',
widen_factor=0.25),
text_model=dict(
frozen_modules=[
'all',
],
model_name='openai/clip-vit-base-patch32',
type='HuggingCLIPLanguageBackbone'),
type='MultiModalYOLOBackbone'),
bbox_head=dict(
bbox_coder=dict(type='DistancePointBBoxCoder'),
head_module=dict(
act_cfg=dict(inplace=True, type='SiLU'),
embed_dims=512,
featmap_strides=[
8,
16,
32,
],
in_channels=[
256,
512,
1024,
],
norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
num_classes=1203,
reg_max=16,
type='YOLOWorldHeadModule',
use_bn_head=True,
use_einsum=False,
widen_factor=0.25),
loss_bbox=dict(
bbox_format='xyxy',
iou_mode='ciou',
loss_weight=7.5,
reduction='sum',
return_iou=False,
type='IoULoss'),
loss_cls=dict(
loss_weight=0.5,
reduction='none',
type='mmdet.CrossEntropyLoss',
use_sigmoid=True),
loss_dfl=dict(
loss_weight=0.375,
reduction='mean',
type='mmdet.DistributionFocalLoss'),
prior_generator=dict(
offset=0.5, strides=[
8,
16,
32,
], type='mmdet.MlvlPointGenerator'),
type='YOLOWorldHead'),
data_preprocessor=dict(
bgr_to_rgb=True,
mean=[
0.0,
0.0,
0.0,
],
std=[
255.0,
255.0,
255.0,
],
type='YOLOWDetDataPreprocessor'),
mm_neck=True,
neck=dict(
act_cfg=dict(inplace=True, type='SiLU'),
block_cfg=dict(type='MaxSigmoidCSPLayerWithTwoConv'),
deepen_factor=0.33,
embed_channels=[
64,
128,
256,
],
guide_channels=512,
in_channels=[
256,
512,
1024,
],
norm_cfg=dict(eps=0.001, momentum=0.03, type='BN'),
num_csp_blocks=3,
num_heads=[
8,
16,
32,
],
out_channels=[
256,
512,
1024,
],
type='YOLOWorldPAFPN',
widen_factor=0.25),
num_test_classes=1203,
num_train_classes=80,
test_cfg=dict(
max_per_img=300,
multi_label=True,
nms=dict(iou_threshold=0.7, type='nms'),
nms_pre=30000,
score_thr=0.001),
train_cfg=dict(
assigner=dict(
alpha=0.5,
beta=6.0,
eps=1e-09,
num_classes=80,
topk=10,
type='BatchTaskAlignedAssigner',
use_ciou=True)),
type='YOLOWorldDetector')
model_test_cfg = dict(
max_per_img=300,
multi_label=True,
nms=dict(iou_threshold=0.7, type='nms'),
nms_pre=30000,
score_thr=0.001)
neck_embed_channels = [
64,
128,
256,
]
neck_num_heads = [
8,
16,
32,
]
norm_cfg = dict(eps=0.001, momentum=0.03, type='BN')
num_classes = 1203
num_det_layers = 3
num_training_classes = 80
obj365v1_train_dataset = dict(
class_text_path='data/texts/obj365v1_class_texts.json',
dataset=dict(
ann_file='objects365_train.json',
data_prefix=dict(img='train/'),
data_root='/home/dq/datasets/objects365v1/',
filter_cfg=dict(filter_empty_gt=False, min_size=32),
type='YOLOv5Objects365V1Dataset'),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='MultiModalDataset')
optim_wrapper = dict(
clip_grad=dict(max_norm=10.0),
constructor='YOLOWv5OptimizerConstructor',
loss_scale='dynamic',
optimizer=dict(
batch_size_per_gpu=64, lr=0.001, type='AdamW', weight_decay=0.025),
paramwise_cfg=dict(
bias_decay_mult=0.0,
custom_keys=dict({
'backbone.text_model': dict(lr_mult=0.01),
'logit_scale': dict(weight_decay=0.0)
}),
norm_decay_mult=0.0),
type='AmpOptimWrapper')
persistent_workers = True
pre_transform = [
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
]
resume = 'work_dirs/yolo_world_v2_n_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival/epoch_75.pth'
save_epoch_intervals = 5
strides = [
8,
16,
32,
]
tal_alpha = 0.5
tal_beta = 6.0
tal_topk = 10
test_cfg = dict(type='TestLoop')
test_dataloader = dict(
batch_size=1,
dataset=dict(
class_text_path='data/texts/lvis_v1_class_texts.json',
dataset=dict(
ann_file='lvis_v1_minival_inserted_image_name.json',
batch_shapes_cfg=None,
data_prefix=dict(img=''),
data_root='/home/dq/datasets/coco/',
test_mode=True,
type='YOLOv5LVISV1Dataset'),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
288,
288,
),
type='LetterResize'),
dict(_scope_='mmdet', type='LoadAnnotations', with_bbox=True),
dict(type='LoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'scale_factor',
'pad_param',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='MultiModalDataset'),
drop_last=False,
num_workers=1,
persistent_workers=True,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
test_evaluator = dict(
ann_file='/home/dq/datasets/coco/lvis_v1_minival_inserted_image_name.json',
metric='bbox',
type='mmdet.LVISMetric')
test_pipeline = [
dict(backend_args=None, type='LoadImageFromFile'),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
288,
288,
),
type='LetterResize'),
dict(_scope_='mmdet', type='LoadAnnotations', with_bbox=True),
dict(type='LoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'scale_factor',
'pad_param',
'texts',
),
type='mmdet.PackDetInputs'),
]
text_channels = 512
text_transform = [
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
]
train_ann_file = 'annotations/instances_train2017.json'
train_batch_size_per_gpu = 64
train_cfg = dict(
dynamic_intervals=[
(
80,
2,
),
],
max_epochs=100,
type='EpochBasedTrainLoop',
val_interval=5)
train_data_prefix = 'train2017/'
train_dataloader = dict(
batch_size=64,
collate_fn=dict(type='yolow_collate'),
dataset=dict(
datasets=[
dict(
class_text_path='data/texts/obj365v1_class_texts.json',
dataset=dict(
ann_file='objects365_train.json',
data_prefix=dict(img='train/'),
data_root='/home/dq/datasets/objects365v1/',
filter_cfg=dict(filter_empty_gt=False, min_size=32),
type='YOLOv5Objects365V1Dataset'),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='MultiModalDataset'),
dict(
ann_file='final_flickr_separateGT_train.json',
data_prefix=dict(img='images/'),
data_root='/home/dq/datasets/flickr/',
filter_cfg=dict(filter_empty_gt=True, min_size=32),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='YOLOv5MixedGroundingDataset'),
dict(
ann_file='final_mixed_train_no_coco.json',
data_prefix=dict(img='images/'),
data_root='/home/dq/datasets/mixed_grounding/',
filter_cfg=dict(filter_empty_gt=False, min_size=32),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='YOLOv5MixedGroundingDataset'),
],
ignore_keys=[
'classes',
'palette',
],
type='ConcatDataset'),
num_workers=64,
persistent_workers=True,
pin_memory=True,
sampler=dict(shuffle=True, type='DefaultSampler'))
train_num_workers = 64
train_pipeline = [
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(
img_scale=(
288,
288,
),
pad_val=114.0,
pre_transform=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
],
type='MultiModalMosaic'),
dict(
border=(
-144,
-144,
),
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
]
train_pipeline_stage2 = [
dict(backend_args=None, type='LoadImageFromFile'),
dict(type='LoadAnnotations', with_bbox=True),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=True,
pad_val=dict(img=114.0),
scale=(
288,
288,
),
type='LetterResize'),
dict(
border_val=(
114,
114,
114,
),
max_aspect_ratio=100,
max_rotate_degree=0.0,
max_shear_degree=0.0,
scaling_ratio_range=(
0.5,
1.5,
),
type='YOLOv5RandomAffine'),
dict(
bbox_params=dict(
format='pascal_voc',
label_fields=[
'gt_bboxes_labels',
'gt_ignore_flags',
],
type='BboxParams'),
keymap=dict(gt_bboxes='bboxes', img='image'),
transforms=[
dict(p=0.01, type='Blur'),
dict(p=0.01, type='MedianBlur'),
dict(p=0.01, type='ToGray'),
dict(p=0.01, type='CLAHE'),
],
type='mmdet.Albu'),
dict(type='YOLOv5HSVRandomAug'),
dict(prob=0.5, type='mmdet.RandomFlip'),
dict(
max_num_samples=80,
num_neg_samples=(
1203,
1203,
),
padding_to_max=True,
padding_value='',
type='RandomLoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'flip',
'flip_direction',
'texts',
),
type='mmdet.PackDetInputs'),
]
tta_model = dict(
tta_cfg=dict(max_per_img=300, nms=dict(iou_threshold=0.65, type='nms')),
type='mmdet.DetTTAModel')
tta_pipeline = [
dict(backend_args=None, type='LoadImageFromFile'),
dict(
transforms=[
[
dict(
transforms=[
dict(scale=(
640,
640,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
640,
640,
),
type='LetterResize'),
],
type='Compose'),
dict(
transforms=[
dict(scale=(
320,
320,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
320,
320,
),
type='LetterResize'),
],
type='Compose'),
dict(
transforms=[
dict(scale=(
960,
960,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
960,
960,
),
type='LetterResize'),
],
type='Compose'),
],
[
dict(prob=1.0, type='mmdet.RandomFlip'),
dict(prob=0.0, type='mmdet.RandomFlip'),
],
[
dict(type='mmdet.LoadAnnotations', with_bbox=True),
],
[
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'scale_factor',
'pad_param',
'flip',
'flip_direction',
),
type='mmdet.PackDetInputs'),
],
],
type='TestTimeAug'),
]
use_mask2refine = True
val_ann_file = 'valid/_annotations.coco.json'
val_cfg = dict(type='ValLoop')
val_data_prefix = 'valid/'
val_dataloader = dict(
batch_size=1,
dataset=dict(
class_text_path='data/texts/lvis_v1_class_texts.json',
dataset=dict(
ann_file='lvis_v1_minival_inserted_image_name.json',
batch_shapes_cfg=None,
data_prefix=dict(img=''),
data_root='/home/dq/datasets/coco/',
test_mode=True,
type='YOLOv5LVISV1Dataset'),
pipeline=[
dict(backend_args=None, type='LoadImageFromFile'),
dict(scale=(
288,
288,
), type='YOLOv5KeepRatioResize'),
dict(
allow_scale_up=False,
pad_val=dict(img=114),
scale=(
288,
288,
),
type='LetterResize'),
dict(_scope_='mmdet', type='LoadAnnotations', with_bbox=True),
dict(type='LoadText'),
dict(
meta_keys=(
'img_id',
'img_path',
'ori_shape',
'img_shape',
'scale_factor',
'pad_param',
'texts',
),
type='mmdet.PackDetInputs'),
],
type='MultiModalDataset'),
drop_last=False,
num_workers=1,
persistent_workers=True,
pin_memory=True,
sampler=dict(shuffle=False, type='DefaultSampler'))
val_evaluator = dict(
ann_file='/home/dq/datasets/coco/lvis_v1_minival_inserted_image_name.json',
metric='bbox',
type='mmdet.LVISMetric')
val_interval_stage2 = 2
vis_backends = [
dict(type='LocalVisBackend'),
dict(type='TensorboardVisBackend'),
]
visualizer = dict(
name='visualizer',
type='mmdet.DetLocalVisualizer',
vis_backends=[
dict(type='LocalVisBackend'),
dict(type='TensorboardVisBackend'),
])
weight_decay = 0.025
widen_factor = 0.25
work_dir = './work_dirs/yolo_world_v2_n_vlpan_bn_2e-3_100e_4x8gpus_obj365v1_goldg_train_lvis_minival'
2024/06/19 07:03:49 - mmengine - INFO - Using SyncBatchNorm()
2024/06/19 07:03:49 - mmengine - INFO - Hooks will be executed in the following order: