libai icon indicating copy to clipboard operation
libai copied to clipboard

Support libai DETR project

Open HiHippie opened this issue 2 years ago • 26 comments

TODO LIST:

  • [x] coco_dataset预处理
  • [x] modeling
  • [x] trainer
  • [x] torch权重加载测试(已对齐)
  • [x] eager global tensor parallel evaluation结果对齐
  • [x] 更libai的transformer实现,目前版本参考很多torch.nn.MultiHeadAttention
  • [ ] 推进训练

oneflow bug和不支持算子记录

  1. oneflow min/max op 无法在不同数据类型间执行
  2. flow.cumsum ~~tensor.cumsum~~
  3. ~~nn.MultiHeadAttention~~
  4. ~~flow.cdist~~
  5. flow.as_tensor从numpy array转换时无法显式指定data type
  6. ~~flow.full_like~~
  7. for m in tensor: m[0]=False 并不会改变tensor数值
  8. tensor.copy_()不管用
  9. F.interpolate行为不一致
  10. tensor.split当split_size_or_sections=[x,0]的时候有bug
  11. ~~flow.ByteStorage~~
  12. tensor.unbind在global tensor中 *** NotImplementedError

HiHippie avatar Apr 12 '22 03:04 HiHippie

from ~~flowvision~~torchvision.models._utils import IntermediateLayerGetter

不支持

HiHippie avatar Apr 13 '22 02:04 HiHippie

from ~flowvision~torchvision.models._utils import IntermediateLayerGetter

不支持

这个地方flowvision不支持吗,那这边我去flowvision下更新一下然后打个tag包吧

rentainhe avatar Apr 13 '22 02:04 rentainhe

from ~flowvision~torchvision.models._utils import IntermediateLayerGetter 不支持

这个地方flowvision不支持吗,那这边我去flowvision下更新一下然后打个tag包吧

嗯不支持的~

哈哈行~我刚准备绕一下先

HiHippie avatar Apr 13 '22 02:04 HiHippie

from ~flowvision~torchvision.models._utils import IntermediateLayerGetter 不支持

这个地方flowvision不支持吗,那这边我去flowvision下更新一下然后打个tag包吧

嗯不支持的~

哈哈行~我刚准备绕一下先

好像在这里支持了 https://github.com/Oneflow-Inc/vision/blob/main/flowvision/models/layer_getter.py, 应该是文件名没有对齐23333

rentainhe avatar Apr 13 '22 02:04 rentainhe

from ~flowvision~torchvision.models._utils import IntermediateLayerGetter 不支持

这个地方flowvision不支持吗,那这边我去flowvision下更新一下然后打个tag包吧

嗯不支持的~ 哈哈行~我刚准备绕一下先

好像在这里支持了 https://github.com/Oneflow-Inc/vision/blob/main/flowvision/models/layer_getter.py, 应该是文件名没有对齐23333

收到~

HiHippie avatar Apr 13 '22 02:04 HiHippie

oneflow min/max op 无法在不同数据类型间执行

>>> flow.__version__
'0.8.0.dev20220411+cu102'
>>> torch.__version__
'1.11.0+cu102'

最小复现代码 以float64和float32为例,其他不同类型间同理

torch

>>> import torch
>>> x = torch.randn(5, dtype=torch.float32)
>>> y = torch.randn(5, dtype=torch.float64)
>>> torch.max(x,y)
tensor([ 1.1421,  1.2252,  0.3676,  1.0047, -0.0242], dtype=torch.float64)
>>> torch.min(x,y)
tensor([-0.4623, -0.1920, -0.8689, -0.4471, -0.2798], dtype=torch.float64)

oneflow

>>> x = flow.randn(5, dtype=flow.float32)
>>> y = flow.randn(5, dtype=flow.float64)
>>> flow.max(x,y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
oneflow._oneflow_internal.exception.Exception: 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 139, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 96, in Apply
    internal_->Apply(op_expr, inputs, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp", line 139, in NaiveInterpret
    user_op_expr.InferPhysicalShapeAndDType( attrs, device_tag ... TensorMeta* { return output_tensor_metas->at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 445, in InferPhysicalShapeAndDType
    dtype_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/infer_util.cpp", line 54, in UnchangedDataType
    Check failed: (tensor_desc.data_type()) == (first_tensor_desc->data_type()) (3 vs 2)

>>> flow.min(x,y)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
oneflow._oneflow_internal.exception.Exception: 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 139, in Dispatch<oneflow::one::Tensor>
    Dispatch<TensorTuple>(op_expr, inputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter_util.cpp", line 131, in Dispatch<oneflow::one::TensorTuple>
    Dispatch(op_expr, inputs, outputs.get(), ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/op_interpreter.cpp", line 96, in Apply
    internal_->Apply(op_expr, inputs, outputs, ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_interpreter/eager_mirrored_op_interpreter.cpp", line 139, in NaiveInterpret
    user_op_expr.InferPhysicalShapeAndDType( attrs, device_tag ... TensorMeta* { return output_tensor_metas->at(i); })
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/op_expr.cpp", line 445, in InferPhysicalShapeAndDType
    dtype_infer_fn_(&infer_ctx)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/framework/infer_util.cpp", line 54, in UnchangedDataType
    Check failed: (tensor_desc.data_type()) == (first_tensor_desc->data_type()) (3 vs 2)

已同步至https://github.com/Oneflow-Inc/OneTeam/issues/1207

HiHippie avatar Apr 14 '22 07:04 HiHippie

flow.cumsum支持,tensor.cumsum不支持

>>> flow.__version__
'0.8.0.dev20220411+cu102'
>>> torch.__version__
'1.11.0+cu102'
>>> x = flow.randn(10,10,10)
>>> y = flow.cumsum(x,1)
>>> y = x.cumsum(1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'oneflow._oneflow_internal.Tensor' object has no attribute 'cumsum'
>>> x = torch.randn(10,10,10)
>>> y = torch.cumsum(x,1)
>>> y = x.cumsum(1)

无法指定dtype

>>> x = flow.randn(5,5)
>>> flow.cumsum(x,dim=0)
tensor([[ 0.0508,  1.0346, -0.7175, -0.2991,  0.7678],
        [ 0.4012,  2.2157, -1.1069,  0.7856,  2.3732],
        [-0.6691,  1.7376, -0.2673,  0.8270,  2.3241],
        [ 0.6488,  2.2601, -1.5217,  1.0009,  2.4177],
        [ 1.0917,  1.9483, -1.0218, -0.4837,  3.5062]], dtype=oneflow.float32)
>>> flow.cumsum(x,dim=0,dtype=flow.float32)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
oneflow._oneflow_internal.exception.Exception: 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/functional/py_function.cpp", line 40, in ReportKwargsError
    TypeError: cumsum(): got multiple values for argument 'dim'

HiHippie avatar Apr 20 '22 04:04 HiHippie

flow.as_tensor从numpy array转换时无法显式指定data type

已同步至:https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow

>>> x=np.random.randn(10)
>>> flow.as_tensor(x)
tensor([-0.3546, -0.6711, -1.3503,  0.7537,  0.4851,  0.4599,  1.4330,  0.2376,  0.3307, -0.1530], dtype=oneflow.float64)
>>> flow.as_tensor(x, dtype=flow.int64)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chiziqiu/.conda/envs/libai/lib/python3.7/site-packages/oneflow/nn/modules/as_tensor.py", line 51, in as_tensor
    raise TypeError("numpy-ndarray holds elements of unsupported datatype")
TypeError: numpy-ndarray holds elements of unsupported datatype
>>> flow.as_tensor(x, dtype=flow.float64)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chiziqiu/.conda/envs/libai/lib/python3.7/site-packages/oneflow/nn/modules/as_tensor.py", line 51, in as_tensor
    raise TypeError("numpy-ndarray holds elements of unsupported datatype")
TypeError: numpy-ndarray holds elements of unsupported datatype

torch

>>> torch.as_tensor(x)
tensor([-0.3546, -0.6711, -1.3503,  0.7537,  0.4851,  0.4599,  1.4330,  0.2376,
         0.3307, -0.1530], dtype=torch.float64)
>>> torch.as_tensor(x, dtype=torch.int64)
tensor([ 0,  0, -1,  0,  0,  0,  1,  0,  0,  0])

HiHippie avatar Apr 24 '22 06:04 HiHippie

for m in tensor: m[0]=False 并不会改变tensor数值

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

oneflow:

>>> mask = flow.ones(10,10)
>>> for m in mask:
...     m[0]=False
... 
>>> m
tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=oneflow.float32)
>>> mask
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=oneflow.float32)

torch:

mask = torch.ones(10,10)
>>> for m in mask:
...     m[0]=False
... 
>>> m
tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> mask
tensor([[0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

已同步至https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125

HiHippie avatar Apr 25 '22 07:04 HiHippie

for m in tensor: m[0]=False 并不会改变tensor数值

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

oneflow:

>>> mask = flow.ones(10,10)
>>> for m in mask:
...     m[0]=False
... 
>>> m
tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1.], dtype=oneflow.float32)
>>> mask
tensor([[1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1., 1., 1., 1., 1., 1.]], dtype=oneflow.float32)

torch:

mask = torch.ones(10,10)
>>> for m in mask:
...     m[0]=False
... 
>>> m
tensor([0., 1., 1., 1., 1., 1., 1., 1., 1., 1.])
>>> mask
tensor([[0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.],
        [0., 1., 1., 1., 1., 1., 1., 1., 1., 1.]])

这个问题应该是我们的tensor[i]返回的不是view而是一个新的tensor,应该露阳在推进这个问题?

@Flowingsun007

BBuf avatar Apr 25 '22 07:04 BBuf

tensor.copy_()不管用

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow:

>>> x = flow.ones(5,5)
>>> y = flow.zeros(3,3)
>>> x[:3,:3].copy_(y)
>>> x
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]], dtype=oneflow.float32)

torch:

>>> x = torch.ones(5,5)
>>> y = torch.zeros(3,3)
>>> x[:3,:3].copy_(y)
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
>>> x
tensor([[0., 0., 0., 1., 1.],
        [0., 0., 0., 1., 1.],
        [0., 0., 0., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

HiHippie avatar Apr 25 '22 08:04 HiHippie

oneflow.nn.functional.interpolate与torch行为不一致。

输出size不同。 对有些输入是ok的,但是有些不行,我目前还没找到规律,因为interpolate的实现我还不大懂。

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow

>>> x = flow.randn(1,2,1204,937)
>>> s = (38,30)
>>> F.interpolate(x,size=s).shape
oneflow.Size([1, 2, 38, 29])

torch:

>>> x = torch.randn(1, 2, 1204, 937)
>>> s = (38,30)
>>> F.interpolate(x, size=s).shape
torch.Size([1, 2, 38, 30])

已同步至:https://github.com/Oneflow-Inc/OneTeam/issues/1207

HiHippie avatar Apr 25 '22 10:04 HiHippie

tensor.copy_()不管用

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow:

>>> x = flow.ones(5,5)
>>> y = flow.zeros(3,3)
>>> x[:3,:3].copy_(y)
>>> x
tensor([[1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]], dtype=oneflow.float32)

torch:

>>> x = torch.ones(5,5)
>>> y = torch.zeros(3,3)
>>> x[:3,:3].copy_(y)
tensor([[0., 0., 0.],
        [0., 0., 0.],
        [0., 0., 0.]])
>>> x
tensor([[0., 0., 0., 1., 1.],
        [0., 0., 0., 1., 1.],
        [0., 0., 0., 1., 1.],
        [1., 1., 1., 1., 1.],
        [1., 1., 1., 1., 1.]])

和第一个问题的原因应该是类似的。

BBuf avatar Apr 25 '22 10:04 BBuf

oneflow.nn.functional.interpolate与torch行为不一致。

输出size不同。 对有些输入是ok的,但是有些不行,我目前还没找到规律,因为interpolate的实现我还不大懂。

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow

>>> x = flow.randn(1,2,1204,937)
>>> s = (38,30)
>>> F.interpolate(x,size=s).shape
oneflow.Size([1, 2, 38, 29])

torch:

>>> x = torch.randn(1, 2, 1204, 937)
>>> s = (38,30)
>>> F.interpolate(x, size=s).shape
torch.Size([1, 2, 38, 30])

已同步至:Oneflow-Inc/OneTeam#1207

这个也记录一下吧,我去修一下。https://github.com/Oneflow-Inc/OneTeam/issues/1207#issuecomment-1073432125

BBuf avatar Apr 25 '22 10:04 BBuf

oneflow.nn.functional.interpolate与torch行为不一致。 输出size不同。 对有些输入是ok的,但是有些不行,我目前还没找到规律,因为interpolate的实现我还不大懂。

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码: flow

>>> x = flow.randn(1,2,1204,937)
>>> s = (38,30)
>>> F.interpolate(x,size=s).shape
oneflow.Size([1, 2, 38, 29])

torch:

>>> x = torch.randn(1, 2, 1204, 937)
>>> s = (38,30)
>>> F.interpolate(x, size=s).shape
torch.Size([1, 2, 38, 30])

已同步至:Oneflow-Inc/OneTeam#1207

这个也记录一下吧,我去修一下。Oneflow-Inc/OneTeam#1207 (comment)

已记录

HiHippie avatar Apr 25 '22 10:04 HiHippie

flow.as_tensor从numpy array转换时无法显式指定data type

已同步至:Oneflow-Inc/OneTeam#1207 (comment)

>>> flow.__version__
'0.8.0.dev20220417+cu112'
>>> torch.__version__
'1.11.0+cu113'

最小复现代码:

flow

>>> x=np.random.randn(10)
>>> flow.as_tensor(x)
tensor([-0.3546, -0.6711, -1.3503,  0.7537,  0.4851,  0.4599,  1.4330,  0.2376,  0.3307, -0.1530], dtype=oneflow.float64)
>>> flow.as_tensor(x, dtype=flow.int64)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chiziqiu/.conda/envs/libai/lib/python3.7/site-packages/oneflow/nn/modules/as_tensor.py", line 51, in as_tensor
    raise TypeError("numpy-ndarray holds elements of unsupported datatype")
TypeError: numpy-ndarray holds elements of unsupported datatype
>>> flow.as_tensor(x, dtype=flow.float64)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/chiziqiu/.conda/envs/libai/lib/python3.7/site-packages/oneflow/nn/modules/as_tensor.py", line 51, in as_tensor
    raise TypeError("numpy-ndarray holds elements of unsupported datatype")
TypeError: numpy-ndarray holds elements of unsupported datatype

torch

>>> torch.as_tensor(x)
tensor([-0.3546, -0.6711, -1.3503,  0.7537,  0.4851,  0.4599,  1.4330,  0.2376,
         0.3307, -0.1530], dtype=torch.float64)
>>> torch.as_tensor(x, dtype=torch.int64)
tensor([ 0,  0, -1,  0,  0,  0,  1,  0,  0,  0])

fixed by this pr:https://github.com/Oneflow-Inc/oneflow/pull/8097

BBuf avatar Apr 26 '22 06:04 BBuf

通过这个网络,还是发现了不少和pytorch不兼容的地方

yuanms2 avatar Apr 30 '22 00:04 yuanms2

通过这个网络,还是发现了不少和pytorch不兼容的地方

是的 不过基本都在着手解决了

HiHippie avatar May 01 '22 01:05 HiHippie

tensor.split当split_size_or_sections=[x,0]的时候有bug。 总结:当0出现在最后一个维度时有bug,在其他维度时没问题

版本

>>> torch.__version__
'1.11.0+cu113'
>>> flow.__version__
'0.8.0.dev20220511+cu112'

最小复现代码:

>>> x = torch.randn(2,100,7)
>>> x.split([7,0],-1)
(tensor([[[ 0.4736, -0.0404, -1.5499,  ...,  1.0757,  0.4028,  0.9903],
         [ 1.8894, -0.4257,  0.2570,  ..., -0.4669, -1.8332, -0.9168],
         [-0.2074,  0.6727, -0.9165,  ..., -1.3757,  1.0796, -1.4637],
         ...,
         [ 1.4639, -0.3440,  0.4957,  ..., -0.4425,  0.9832, -0.1773],
         [ 0.5572, -0.7418,  0.5709,  ..., -0.8357,  0.5164, -1.5137],
         [-0.1484,  0.5784,  0.3132,  ..., -1.7116, -2.4209, -0.6352]],

        [[-0.0512,  0.8071, -0.1806,  ..., -0.6507, -1.7163,  1.2081],
         [-2.1803, -0.2958,  1.4241,  ...,  0.7722, -0.2404, -2.6822],
         [ 0.5165, -0.9405, -0.0473,  ..., -1.7761, -2.6822, -0.2629],
         ...,
         [ 0.1908, -0.8162,  1.2067,  ...,  0.0719,  0.8505, -1.1541],
         [ 1.5042,  0.3226,  1.4068,  ...,  0.2107, -0.4780,  0.6526],
         [-0.4096,  0.9706,  0.6222,  ..., -1.5738,  0.3576, -0.3889]]]), tensor([], size=(2, 100, 0)))
>>> x = flow.randn(2,100,7)
>>> x.split([7,0],-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dataset/chiziqiu/anaconda3/lib/python3.9/site-packages/oneflow/framework/tensor.py", line 709, in _split
    return flow._C.split(self, split_size_or_sections, dim)
RuntimeError: 
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 2376, in operator()
    Narrow(x, axis, start_idx, length)
  File "/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/core/functional/impl/array_functor.cpp", line 1251, in operator()
    Check failed: (-dim_length <= start) && (start <= dim_length - 1)  (Dimension out of range, expected to be in range of [-3, 2], but got:7)

但[0,7]时没问题:

>>> x = flow.randn(2,100,7)
>>> y=x.split([0,7],-1)

总结:当0出现在最后一个维度时有bug,在其他维度时没问题

HiHippie avatar May 12 '22 09:05 HiHippie

Narrow(x, axis, start_idx, length)

可以用nightly试一下,这个问题应该没有了

BBuf avatar May 16 '22 09:05 BBuf

tensor.unbind不支持global tensor

>>> flow.__version__
'0.8.0.dev20220511+cu112'
>>> x = flow.randn(100,4).to_global(sbp=flow.sbp.broadcast, placement=flow.placement("cuda", ranks=[0]))
>>> x.unbind(-1)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/dataset/chiziqiu/anaconda3/lib/python3.9/site-packages/oneflow/framework/tensor.py", line 713, in _unbind
    return flow._C.unbind(self, dim)
NotImplementedError
>>> x = x.to_local()
>>> x.unbind(-1)
(tensor([ 0.6732, -0.3023,  1.4604,  1.3799, -1.1469, -1.4389,  1.6037,  0.9446, -0.5044,  0.3336,  0.6152,  0.3299,  1.2392,  0.5234,  0.0195, -1.8286, -1.0080,  1.9139, -0.7478,  0.1140,  0.6781, -0.0913, -1.3242, -0.3646, -0.2825, -1.4854, -0.9145,  1.0963,  0.7683, -1.1118,  0.4805, -0.1116, -0.4031,
        -1.1590, -1.8794, -0.2065, -0.3353, -0.2933,  0.0486,  0.7751,  1.0546, -0.9393, -1.3231, -0.9217,  1.1032,  0.3692,  0.2003,  0.5020, -0.1534,  0.0768, -0.4615, -1.1294,  0.5750,  0.5136,  0.9569, -0.4370, -1.2251,  0.8619,  0.7274,  1.2086,  1.2627, -1.4440, -0.0259,  1.0530,  0.3159,  1.6366,
        -0.7254,  1.9394,  1.3874, -1.4916,  0.9147,  1.7787,  0.4056, -0.4525,  1.2214, -1.3924,  0.9251, -0.7092,  0.2885, -1.3293, -1.6186,  0.6288,  1.2619,  1.8240, -1.0310,  0.9776, -0.7870, -0.7614,  0.5374, -0.9490, -0.9730,  0.8723,  2.1137,  0.7878,  1.2088,  2.2680,  2.1485, -0.1677,  0.6356,
        -0.8306], device='cuda:0', dtype=oneflow.float32), tensor([ 0.1140,  0.9664, -0.4128, -0.1795, -0.0924, -0.3642,  0.9534, -1.3775, -0.3605,  1.7204,  0.7058,  0.2884, -1.4371,  0.5512,  0.7113,  2.4053, -1.0669,  0.6128, -0.0496, -0.0412,  0.1771, -0.4215, -1.6474, -1.2480,  1.2520,  1.7337, -0.0214,  1.0931, -0.4014, -0.1295,  1.5144,  0.0573, -0.3434,
         0.5316,  1.2685,  0.0731, -0.6885, -0.4406,  1.1376,  1.1505,  2.0006,  0.8170, -0.6909, -1.9807, -0.4704,  1.4258, -1.0187, -0.1252, -0.8503, -1.2286, -0.1469,  0.6578, -1.6982,  0.9313, -0.0991,  0.7988,  1.0618,  0.3656, -0.5173,  0.1062, -0.3526,  0.1705,  0.6896,  0.1062,  1.6790,  0.8675,
        -0.9219,  1.0535,  2.5108,  0.4058,  1.5565, -0.1119, -0.5495,  0.8565,  1.7205,  0.7336,  1.0147,  0.7349, -0.4325, -0.2354,  0.3967, -1.1067, -0.0503,  0.5430, -0.5466,  1.5222, -0.1019, -0.8322,  0.1298, -0.5529,  0.0965,  0.3214, -0.0042,  1.3415, -1.0824,  0.8408, -1.6040, -0.1292,  0.4468,
         1.2888], device='cuda:0', dtype=oneflow.float32), tensor([ 8.2494e-01, -2.4225e+00,  4.4831e-01,  1.2205e+00, -5.5395e-01, -1.9908e+00, -3.6989e-01,  3.1025e-01, -3.3615e-01,  1.2510e+00, -7.6869e-01,  2.2640e-01,  8.9426e-01, -1.0443e-02,  8.0189e-01, -7.1788e-01, -2.0481e-01, -7.3685e-01, -1.0895e+00,  8.1574e-01, -6.5085e-01, -1.5621e+00,  2.4240e+00,
         9.5201e-01,  6.5284e-02, -2.7226e-01,  5.1478e-01,  9.1148e-01, -8.0543e-01,  3.2088e-01, -5.8496e-01, -7.3560e-01, -7.8625e-01,  3.5526e+00, -9.0569e-02, -6.4349e-01, -1.9497e+00,  9.3549e-02,  7.7196e-01,  1.2225e+00,  9.9349e-01, -1.4940e-01, -4.1041e-01,  3.3358e-01,  4.9947e-01, -8.1111e-01,
        -5.5953e-01, -5.3114e-01,  1.2270e-01,  1.1031e+00, -4.3998e-01, -6.5134e-01,  5.9907e-01, -1.5741e+00,  1.1121e+00,  1.7249e+00, -8.3078e-01,  1.8889e+00,  3.7167e-01,  1.3959e+00,  4.0451e-01, -8.9412e-01, -7.2207e-01, -3.1799e-01,  2.9070e-01, -1.2411e+00,  5.2207e-01,  2.1749e+00,  1.6886e+00,
         3.1185e-01,  7.1245e-04, -1.3975e+00, -5.5818e-02,  2.4448e+00, -7.0328e-01,  8.5506e-01,  4.1534e-01, -1.0993e+00, -1.2930e-01,  1.6159e+00,  5.8933e-01,  2.4079e-01,  2.4609e+00,  4.8775e-01,  3.1148e-01,  5.5383e-01, -4.3484e-01,  1.1865e+00,  5.4809e-01,  1.8185e+00, -1.0388e+00, -5.0242e-01,
        -1.1045e+00,  9.4867e-01, -1.3901e+00, -7.4365e-01,  3.7658e-01,  7.6355e-01, -6.8516e-02,  1.1892e+00], device='cuda:0', dtype=oneflow.float32), tensor([ 1.1310, -0.9968, -0.8175, -1.0691,  1.1561, -0.6521, -0.3950, -1.1697, -0.3019, -0.7170, -1.5917, -0.6279, -0.7104,  0.6003, -0.4562, -0.7400, -0.5367,  0.9139,  0.0510,  0.6054,  0.6953, -1.1960,  1.8443,  0.0790, -1.7794, -0.2629,  0.0626, -0.2257, -0.2238, -1.8894,  1.3829,  2.2447, -0.3194,
         0.1188, -1.1480,  0.1640, -2.7212, -0.0848, -0.1022,  2.7401,  0.3600, -0.6510,  1.4652,  1.4443, -1.0385, -0.8625, -0.3573, -1.0436, -0.5471, -0.3780,  0.2603, -2.6162,  0.0034, -2.3554,  0.9569, -1.1303, -2.0769, -1.4830,  0.2238, -0.3018,  0.6321,  1.0973, -0.7001,  0.0135, -1.1057, -0.1395,
        -0.1630, -1.0537,  0.6513,  0.6935,  1.3550, -0.5250, -1.4301, -1.2223,  0.2209,  0.1352, -0.8554, -0.3600,  2.4356, -0.2436,  0.6964, -0.7971, -0.3240,  1.0740, -0.1335, -0.1686, -0.2754,  1.5222,  0.6987,  0.2988, -0.4435, -0.5215, -0.2787, -1.3216, -1.4181, -1.1776,  0.4957,  1.4997, -0.0745,
        -0.0787], device='cuda:0', dtype=oneflow.float32))

HiHippie avatar Jun 13 '22 05:06 HiHippie

def get_default_optimizer_params(
    model,
    base_lr=None,
    weight_decay=None,
    weight_decay_norm=None,
    weight_decay_bias=None,
    clip_grad_max_norm=None,
    clip_grad_norm_type=None,
    overrides=None,
):
    """
    Get default param list for optimizer, with suport for a few types of overrides.
    If no overrides are needed, it is equivalent to `model.parameters()`.

    Arguments:
        base_lr: lr for every group by default. Can be omitted to use the one in optimizer.
        weight_decay: weight decay for every group by default. Can be omitted to use the one
            in optimizer.
        weight_decay_norm: override weight decay for params in normalization layers
        weight_decay_bias: override weight decay for bias parameters
        overrides: if not `None`, provides values for optimizer hyperparameters
            (LR, weight decay) for module parameters with a given name; e.g.
            ``{"embedding": {"lr": 0.01, "weight_decay": 0.1}}`` will set the LR and
            weight decay values for all module parameters named `embedding`.

    For common transformer models, ``weight_decay_norm`` and ``weight_decay_bias``
    are usually set to 0.

    Example:
    ::

        flow.optim.AdamW(
            get_default_optimizer_params(model, weight_decay_norm=0, weight_decay_bias=0),
            lr=0.01,
            weight_decay=1e-4
        )
    """
    if overrides is None:
        overrides = {}
    defaults = {}
    if base_lr is not None:
        defaults["lr"] = base_lr
    if weight_decay is not None:
        defaults["weight_decay"] = weight_decay
    if clip_grad_max_norm is not None and clip_grad_norm_type is not None:
        defaults["clip_grad_max_norm"] = clip_grad_max_norm
        defaults["clip_grad_norm_type"] = clip_grad_norm_type
    bias_overrides = {}
    if weight_decay_bias is not None:
        bias_overrides["weight_decay"] = weight_decay_bias
    if len(bias_overrides):
        if "bias" in overrides:
            raise ValueError("Conflicting overrides for 'bias'")
        overrides["bias"] = bias_overrides

    norm_module_types = (
        LayerNorm,
        flow.nn.BatchNorm1d,
        flow.nn.BatchNorm2d,
        flow.nn.BatchNorm3d,
        flow.nn.GroupNorm,
        flow.nn.InstanceNorm1d,
        flow.nn.InstanceNorm2d,
        flow.nn.InstanceNorm3d,
        flow.nn.FusedBatchNorm1d,
        flow.nn.FusedBatchNorm2d,
        flow.nn.FusedBatchNorm3d,
    )
    params = []
    memo = set()
    for module in model.modules():
        for model_param_name, value in module.named_parameters(recurse=False):
            if not value.requires_grad:
                continue
            # Avoid duplicating parameters
            if value in memo:
                continue
            memo.add(value)

            hyperparams = copy.copy(defaults)
            if isinstance(module, norm_module_types) and weight_decay_norm is not None:
                hyperparams["weight_decay"] = weight_decay_norm
            hyperparams.update(overrides.get(model_param_name, {}))
            params.append({"params": [value], **hyperparams})
    return reduce_param_groups(params)

libai关于optimizer的这段代码中module.named_parameters(recurse=False)为什么recurse=False?True的话model_param_name显示的信息更全一些,比如可以用"transformer" in model_param_name的方式匹配参数。 请教下设置为False是有什么特殊考虑吗? @CPFLAME

HiHippie avatar Jun 21 '22 09:06 HiHippie

for module in model.modules() 已经是递归遍历了,所以这个 module 里面的参数不再递归,你想获得完整的名字可以靠 module 去获得 @HiHippie

L1aoXingyu avatar Jun 21 '22 10:06 L1aoXingyu

for module in model.modules() 已经是递归遍历了,所以这个 module 里面的参数不再递归,你想获得完整的名字可以靠 module 去获得 @HiHippie

好的~

HiHippie avatar Jun 21 '22 11:06 HiHippie

子秋注意跟踪一下,你在detr中反馈出来的问题是不是被修复了

yuanms2 avatar Aug 24 '22 01:08 yuanms2

子秋注意跟踪一下,你在detr中反馈出来的问题是不是被修复了

好的 袁老师

HiHippie avatar Aug 24 '22 02:08 HiHippie