huang wei issues

Results 7 issues of


                                            huang wei

Dev add dalle2

# 添加dalle2，后续持续完善中并记录遇到的一些问题。 python: 3.6.9 oneflow: 0.7.0+cu112 代码主要参考[dalle2_pytorch](https://github.com/lucidrains/DALLE2-pytorch), 里面还有一部分没太看明白:( ，基本上是 `import torch` -> `import oneflow as flow`, 然后把代码中的'torch'全部替换为'flow', 少部分接口改一改入参格式，基本可以跑通 :) 。

matmul矩阵乘法在dim=0时失效

oneflow 在执行矩阵乘法时，如存在dim=0的维度，则会报错 ``` >>> import torch >>> import oneflow as flow loaded library: /lib/x86_64-linux-gnu/libibverbs.so.1 >>> torch.__version__ '1.10.2' >>> flow.__version__ '0.8.1.dev20220903+cu112' >>> a = torch.randn(0, 5) >>> b = torch.randn(5, 6)...

api

Dev alphafold fused attn

fused kernels in alphafold

fp16 division is inconsistent with torch

## Summary fp16除法操作返回结果类型和torch不一致 ## Code to reproduce bug ```python >>>import oneflow as flow >>>a = flow.randn(3, 3, dtype=flow.float16).cuda() >>>b = flow.randn(3, 3, dtype=flow.float16).cuda() >>>a/b tensor([[-2.1495e-03, 1.5983e+00, -5.2973e-01], [-1.7968e-01, -4.0361e+00, 5.4459e-01],...

bug

api

community

sliding window size in prefill and decode stage

Hello, I noticed that the sliding window size may be different in the prefill stage and the decode stage. As in the prefill stage, the current token is visible along...

Fused llama kernel

llama模型并行推理优化，将每一层LlamaDecoderLayer 所有的cuda kernel放在一个大op里, 尽可能减少python层面指令发送的延迟。

The position id for q

Hello, I wonder if the position id of query is the same with key or is the actual generated context length ([this comment is confusing me](https://github.com/mit-han-lab/streaming-llm/blob/d729b3ffc947caca63fc0f7644b7468ca2d50881/streaming_llm/pos_shift/modify_llama.py#L89))? For example, as mentioned...