binbinHan comments

Results 7 comments of


                                            binbinHan

compared the output results of acceleration schemes from both the deepcache and onediff versions

@onefish51 Deepcache is a lossy algorithm. If you want to be close to the original algorithm, you can adjust cache_interval to a smaller value, or adjust cache_layer_id and cache_block_id to...

dynamic batch size failed

@HydrogenQAQ sorry i can not reproduce the erro with your script. con you tell us version of diffusers in your env? Or maybe you can update oneflow and onediff then...

2、1D 并行 @clackhan [Global tensor](https://docs.oneflow.org/master/parallelism/03_consistent_tensor.html)可以轻松支持任何并行性，包括数据并行性、模型并行性，可以跨多台机器运行。 > **注意：** 本教程中的代码在 2-GPU 服务器上运行，但可以轻松推广到其他环境 - [ ] 数据并行 - 模型构建在数据并行模式中，每个GPU上包含完整的模型参数，各张卡的参数完全相同，每个rank输入不同的数据。接下来使用Global 模式训练数据并行网络，第一步是创建模型，下面代码定义了一个包含两个全连接层的网络，并将其扩展到到两卡。 > **注意：** 代码中单模型通过to_global扩展到两卡时，会将rank 0上模型的参数广播到其他rank上，故无需担心不同的进程上模型参数初始值不同。 ```python import oneflow as flow import oneflow.nn as...

nccl not support for float16?

I cannot reproduce your problem. Can you print dtype of input and weight before doing matmul to make sure they are the same? If the dtype of input and weight...

Fused llama kernel

> fast transformer 是这样做的吗？ > fast transformer是纯c++实现，可以认为是一个专用实现，代码中实现了一个`Llama`类，编译生成一个可行性的二进制文件，运行时创建一个Llama实例，在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作，所以整个算子launch过程非常快。目前`Llama`还处于第三方pr状态，没有python实现。 fast transformer主仓库中比较成熟的实现如GPT，也是基本上是这个套路，其pytorch和tensorflow实现就是将c++端的`class GptOp`包装一下导出到python端。 > llama 的 python 实现需要手工改动吗？还是自动通过模式匹配实现的？使用融合算子时需要手工改动代码。

Fused llama kernel

> > 在创建这个对象时会统一申请全部计算所需内存，析构时统一释放内存，因为是纯c++计算且整个过程没有内存申请操作 > > 之前提到推理时有个动态 shape 的问题，它是取 max 去申请了内存么是的，申请了最大所需内存

save_pipe and load_pipe not work

@forestlet This is because of the force_upcast of vae. You need execute the next code before load_pipe: ```python if pipe.vae.dtype == torch.float16 and pipe.vae.config.force_upcast: pipe.upcast_vae() ``` And we will integrate...

binbinHan

compared the output results of acceleration schemes from both the deepcache and onediff versions

dynamic batch size failed

教程文档：分布式专题

nccl not support for float16?

Fused llama kernel

Fused llama kernel

save_pipe and load_pipe not work