Daqiu Shi comments

Results 34 comments of


                                            Daqiu Shi

DETR结果对齐实验记录

> 是inference的结果吗~ 是的。今天排查到我实现的multihead attention和torch.nn.MultiHeadAttention不一致(detr源代码用的它)，可能是这里的问题，目前在修改代码。

**对某些input shape导致loss.backward报错"F20220602 14:17:25.050042 15603 shape.cpp:187] Check failed: !broadcast_axis_vec.empty() "问题的排查** 问题定位至：projects/DETR/utils/box_ops.py 中 min/max oneflow的bug ``` def generalized_box_iou(boxes1, boxes2): """ Generalized IoU from https://giou.stanford.edu/ The boxes should be in [x0, y0, x1,...

DETR结果对齐实验记录

libai/utils/distributed.py 中 ``` def convert_to_distributed_default_setting(module): """ Helper function to convert all eager local tensor in :attr:`nn.Module` in the model to global tensor with data parallelism as default. """ for param...

DETR结果对齐实验记录

> OK，我来试试

DETR结果对齐实验记录

`global eager ddp` 4卡数据并行很快就会报如下OOM错误，2卡会后面一点再报错。 ``` F20220713 07:57:17.976464 1348305 virtual_machine_engine.cpp:332] File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/vm/virtual_machine_engine.cpp", line 332, in DispatchInstruction ret File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 49, in Prepare AllocateOutputBlobsMemory(operand, device_ctx) File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/eager/op_call_instruction_type.cpp", line 103, in AllocateOutputBlobsMemory...

DETR结果对齐实验记录

> 是不是有一些变量没有及时释放我排查下

DETR结果对齐实验记录

上面的问题定位到了，是因为在执行`hidden_state+position_embedding`时候如果二者的sbp不一致(hidden_state是split(0)，position_embedding是broadcast)，就会导致OOM问题。但如果二者保持一致(split(0))，就没问题了。详细的最小复现我明天整理下这可能是个潜在的bug？

DETR结果对齐实验记录

记录一个之前遗留的问题首先有如下代码，transformer两个output，第二个没有用到 ``` hs, _ = self.transformer(self.input_proj(src), mask, self.query_embed.weight, pos[-1]) ``` 在transformer内部，逻辑如下： ``` memory = self.encoder(src, src_key_padding_mask=mask, pos=pos_embed) hs = self.decoder(tgt, memory, memory_key_padding_mask=mask, pos=pos_embed, query_pos=query_embed) return hs.transpose(1, 2), memory.permute(1, 2,...

DETR结果对齐实验记录

**记录待复现/排查的bug** 训练过程中会遇到`RuntimeError: Check failed: in_tensor_desc.is_dynamic() == false`，暂未定位到问题。求助guo ran后得知是“系统中对is_dynamic的处理不太完善，很多op都假设处理的静态的情况”。 DETR有很多padding，以及动态大小的tensor情况，且用到很多reshape, permute之类的op，可能是潜在的原因。 **复现/排查到之后会更新过来。** ``` File "/dataset/czq_home/projects/libai/libai/engine/default.py", line 472, in train super().train(self.start_iter, self.max_iter) File "/dataset/czq_home/projects/libai/libai/engine/trainer.py", line 146, in train self.run_step() File "/dataset/czq_home/projects/libai/libai/engine/default.py",...

DETR结果对齐实验记录

> > @Ldpe2G @BBuf 看看这个应该也是算子方面的问题，对动态形状的处理，yolo 里面应该也会遇到 > > 希望这里可以整理出一份最小复现代码，只看错误栈有点乱且难以定位。好的，我正在查了，只是目前还没搞清楚。有复现代码后会更新过来。