libai 期望libai mae支持graph格式数据并行，流水线并行和模型并行

期望libai mae支持graph格式数据并行，流水线并行和模型并行

Open KellyZhang2020 opened this issue 2 years ago • 5 comments

Apr 12 '22 00:04 KellyZhang2020

好的，这个我们会陆续推进

Apr 15 '22 06:04 rentainhe

MAE-pytorch迁移至MAE-oneflow的接口缺失整理 (与算子兼容计划同步)

[ ] torch.cuda.synchronize
[ ] torch.cuda.max_memory_alocated
[ ] torch.nn.parallel.DistributedDataParallel() 入参没对齐
[ ] 缺少tensor.median()方法
[ ] oneflow.nn.utils.clip_grad_norm_不支持传入None

Apr 15 '22 06:04 rentainhe

可以稍微写详细点吗？比如贴一个没对齐或者报错的示例。

@rentainhe

Apr 18 '22 02:04 BBuf

可以稍微写详细点吗？比如贴一个没对齐或者报错的示例。

@rentainhe

好的，我这边跟用户一起整理一下

Apr 18 '22 02:04 rentainhe

最小复现example

tensor.median()

import torch
x = torch.randn(1, 2, 4)
print(x.median())

import oneflow as flow
y = flow.randn(1, 2, 4)
print(y.median())

torch.cuda.synchronize
torch.cuda.max_memory_alocated

这两个应该是没有对应接口

torch.nn.parallel.DistributedDataParallel()入参没对齐

import torch
torch.nn.parallel.DistributedDataParallel()
"""
Args:
    module,
    device_ids=None,
    output_device=None,
    dim=0,
    broadcast_buffers=True,
    process_group=None,
    bucket_cap_mb=25,
    find_unused_parameters=False,
    check_reduction=False,
    gradient_as_bucket_view=False,
    static_graph=False,
"""

import oneflow.nn.parallel as parallel
parallel.DistributedDataParallel()
"""
Args:
    module: "flow.nn.Module"
    broadcast_buffers: bool = True, 
    bucket_size: int = 10
"""

Apr 21 '22 02:04 rentainhe

libai libai copied to clipboard

期望libai mae支持graph格式数据并行，流水线并行和模型并行

MAE-pytorch迁移至MAE-oneflow的接口缺失整理 (与算子兼容计划同步)

最小复现example

libai
libai copied to clipboard