terryLiu

Results 4 issues of terryLiu

hi,感谢你们开创性的工作,我注意到在cogcom中的steps中包含`grounding,crop_and_zoomin,counting,OCR`等func操作,根据项目需要,我希望在里面添加可以**检测关键点**的fun操作,如`pose`操作,其形式为`[x1,y1,x2,y2,kpt1,kpt2]`,其中,`kpt1`和`kpt2`为目标的关键点坐标,请问添加关键点func操作除在生成com数据时需要进行一定的修改,在**finetuning**时是否需要修改finetuning部分的代码? @erjanmx @Sleepychord @cenyk1230 @Btlmd @1049451037

硬件 4*A100(80G) 微调官方com_dataset数据集,出现如下情况 > Traceback (most recent call last): File "/home/lyk/project/CogCoM/cogcom/finetune.py", line 324, in model = training_main(args, model_cls=model, ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/home/lyk/anaconda3/envs/llm/lib/python3.11/site-packages/sat/training/deepspeed_training.py", line 150, in training_main iteration, skipped = train(model, optimizer,...

如上所示,若数据集中没有'crop_and_zoomin'操作时,则训练可以正常,但添加该操作后,训练会卡在fintune.py程序`broadcast_auto_com`函数中的`mpu.broadcast_data`下的`torch.distributed.broadcast`操作,然后返回如下结果: ` > [rank6]:[E ProcessGroupNCCL.cpp:523] [Rank 2] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=3667, OpType=BROADCAST, NumelIn=2, NumelOut=2, Timeout(ms)=600000) ran for 600727 milliseconds before timing out. > [rank7]:[E ProcessGroupNCCL.cpp:523] [Rank 3] Watchdog...

如题所示,感谢回答 @qijimrc