libai icon indicating copy to clipboard operation
libai copied to clipboard

期望libai mae支持graph格式数据并行,流水线并行和模型并行

Open KellyZhang2020 opened this issue 2 years ago • 5 comments

KellyZhang2020 avatar Apr 12 '22 00:04 KellyZhang2020

好的,这个我们会陆续推进

rentainhe avatar Apr 15 '22 06:04 rentainhe

MAE-pytorch迁移至MAE-oneflow的接口缺失整理 (与算子兼容计划同步)

  • [ ] torch.cuda.synchronize
  • [ ] torch.cuda.max_memory_alocated
  • [ ] torch.nn.parallel.DistributedDataParallel() 入参没对齐
  • [ ] 缺少tensor.median()方法
  • [ ] oneflow.nn.utils.clip_grad_norm_不支持传入None

rentainhe avatar Apr 15 '22 06:04 rentainhe

可以稍微写详细点吗?比如贴一个没对齐或者报错的示例。

@rentainhe

BBuf avatar Apr 18 '22 02:04 BBuf

可以稍微写详细点吗?比如贴一个没对齐或者报错的示例。

@rentainhe

好的,我这边跟用户一起整理一下

rentainhe avatar Apr 18 '22 02:04 rentainhe

最小复现example

  • tensor.median()
import torch
x = torch.randn(1, 2, 4)
print(x.median())

import oneflow as flow
y = flow.randn(1, 2, 4)
print(y.median())
  • torch.cuda.synchronize
  • torch.cuda.max_memory_alocated

这两个应该是没有对应接口

  • torch.nn.parallel.DistributedDataParallel()入参没对齐
import torch
torch.nn.parallel.DistributedDataParallel()
"""
Args:
    module,
    device_ids=None,
    output_device=None,
    dim=0,
    broadcast_buffers=True,
    process_group=None,
    bucket_cap_mb=25,
    find_unused_parameters=False,
    check_reduction=False,
    gradient_as_bucket_view=False,
    static_graph=False,
"""

import oneflow.nn.parallel as parallel
parallel.DistributedDataParallel()
"""
Args:
    module: "flow.nn.Module"
    broadcast_buffers: bool = True, 
    bucket_size: int = 10
"""

rentainhe avatar Apr 21 '22 02:04 rentainhe