web-stable-diffusion icon indicating copy to clipboard operation
web-stable-diffusion copied to clipboard

Assistance Required with tvmjs Integration and webgpu.get_fmap Error

Open ZoneLikeWonderland opened this issue 11 months ago • 14 comments

Firstly, I'd like to express my admiration for the remarkable work done on this project. The advancements and capabilities it offers are truly impressive.

I've been diligently following the provided "walkthrough.ipynb" to familiarize myself with the pipeline. Unfortunately, I encountered an issue with the trace part, which seems to malfunction, possibly due to updates in the diffusers library. To circumvent this, I opted for a simplified network module as demonstrated below:

class Net(nn.Module):
    def __init__(self):
        super().__init__()

    def forward(self, x):
        return x + 1

Following this, I proceeded to convert the network to ONNX format and subsequently to IR:

trace = torch.jit.trace(net, input.to(dtype).to(device))

torch.onnx.export(
    trace, input.to(dtype).to(device), "test/net.onnx", verbose=True, input_names=["input"], output_names=["input"],
)
# Exported graph: graph(%input.1 : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu)):
#   %/Constant_output_0 : Float(requires_grad=0, device=cpu) = onnx::Constant[value={1}, onnx_name="/Constant"](), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   %input : Float(1, 3, strides=[3, 1], requires_grad=0, device=cpu) = onnx::Add[onnx_name="/Add"](%input.1, %/Constant_output_0), scope: Net:: # /tmp/ipykernel_3004661/4051825751.py:6:0
#   return (%input)

# ============= Diagnostic Run torch.onnx.export version 2.0.0+cu117 =============
# verbose: False, log level: Level.ERROR
# ======================= 0 NONE 0 NOTE 0 WARNING 0 ERROR ========================
onnx_model_path = "test/net.onnx"
model = onnx.load(onnx_model_path)
tvm_model = from_onnx(model, keep_params_in_input=True)
tvm_model

# # from tvm.script import ir as I
# # from tvm.script import relax as R

# @I.ir_module
# class Module:
#     @R.function
#     def main(input_1: R.Tensor((1, 3), dtype="float32")) -> R.Tensor((1, 3), dtype="float32"):
#         R.func_attr({"num_input": 1})
#         with R.dataflow():
#             gv: R.Tensor((1, 3), dtype="float32") = R.add(input_1, R.const(1, "float32"))
#             R.output(gv)
#         return gv

After that, I compiled it to wasm:

tvm_model, model_params = relax.frontend.detach_params(tvm_model) # no params actually
target = tvm.target.Target(
    "webgpu", host="llvm -mtriple=wasm32-unknown-unknown-wasm"
)
ex = relax.build(mod=tvm_model, target=target)
ex.export_library("test/net.wasm")

Finally, I used the following JS to run it:

const tvmjs = require("./public/dist/tvmjs.bundle.js");
const EmccWASI = require("./public/dist/tvmjs_runtime.wasi.js");


window.tvmjs = tvmjs

async function asyncInitTVM() {


    const wasmSource = await (
        await fetch("./public/net.wasm")
    ).arrayBuffer();


    logger = function (message) {
        console.log(message);
    };

    const tvm = await tvmjs.instantiate(
        new Uint8Array(wasmSource),
        new EmccWASI(),
        logger
    );

    const output = await tvmjs.detectGPUDevice();
    if (output !== undefined) {
        var label = "WebGPU";
        if (output.adapterInfo.description.length != 0) {
            label += " - " + output.adapterInfo.description;
        } else {
            label += " - " + output.adapterInfo.vendor;
        }
        console.log("Initialize GPU device: " + label);
        tvm.initWebGPU(output.device);
    } else {
        console.log("This browser env do not support WebGPU");
    }



    tvm.withNewScope(() => {
        device = tvm.webgpu();
        // device = tvm.cpu();
        vm = tvm.detachFromCurrentScope(tvm.createVirtualMachine(device));
        net = tvm.detachFromCurrentScope(vm.getFunction("main"));
    })

    await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());

    const input_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([1, 1, 1])
        )
    });
    const input_gpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", device)
        )
    });

    input_gpu.copyFrom(input_cpu);
    await tvm.webgpu().sync();
    console.log("input_cpu", input_cpu.toArray());

    tvm.withNewScope(() => {
        output_gpu = net(input_gpu);
        output_gpu = tvm.detachFromCurrentScope(output_gpu);
    });


    const output_cpu = tvm.withNewScope(() => {
        return tvm.detachFromCurrentScope(
            tvm.empty([1, 3], "float32", tvm.cpu()).copyFrom([2, 3, 4])
        )
    });

    output_cpu.copyFrom(output_gpu);
    await tvm.webgpu().sync();
    console.log("output_cpu", output_cpu.toArray());

}

asyncInitTVM()

However, I've hit a roadblock during the execution phase, particularly at await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule());, where the console outputs the following error:

tvmjs.bundle.js:1863  Uncaught (in promise) Error: Cannot find function webgpu.get_fmap
    at Module.getFunction (tvmjs.bundle.js:1863:23)
    at Instance.eval (tvmjs.bundle.js:2791:38)
    at Generator.next (<anonymous>)
    at eval (tvmjs.bundle.js:28:75)
    at new Promise (<anonymous>)
    at __awaiter (tvmjs.bundle.js:24:16)
    at Instance.asyncLoadWebGPUPipelines (tvmjs.bundle.js:2786:20)
    at asyncInitTVM (main.js:48:15)

In addition, I found that when I use llvm as build target instead of webgpu and use tvm.cpu() as device and skip this line, the example is working.

Given the scarcity of detailed documentation and tutorials on integrating custom networks with tvmjs, especially regarding WebGPU support, I find myself in need of your expertise and guidance.

Could you please help me identify any potential missteps in my approach? I am particularly interested in ensuring that my network can be successfully operated using tvmjs and would greatly appreciate any insights or suggestions you might have.

Thank you very much for your time and assistance.

ZoneLikeWonderland avatar Feb 28 '24 18:02 ZoneLikeWonderland

same error... do u solve it?

senlyu163 avatar Apr 17 '24 11:04 senlyu163

same error... do u solve it?

not yet👀

ZoneLikeWonderland avatar Apr 17 '24 11:04 ZoneLikeWonderland

@tqchen Sorry to bother you. If it's convenient, can you give some suggestions?

senlyu163 avatar Apr 18 '24 03:04 senlyu163

@ZoneLikeWonderland The error is still reported on Ubuntu, but it is normal on apple M2Pro. I am fresh in tvm, and dont know why...

senlyu163 avatar Apr 18 '24 11:04 senlyu163

@senlyu163 Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2 Do you have any suggestions? Thanks.

abacaaaaaa avatar May 13 '24 03:05 abacaaaaaa

@senlyu163 Sorry to bother you. Following the get started documentation, I’ve successfully implemented build.py and deploy.py, and they run smoothly during deployment on the web. However, I encountered the following issue:

Generate error, GPUPipelineError: Entry-point uses workgroup_size(1024, 1, 1) that exceeds the maximum allowed (256, 256, 64).
    at ValidateComputeStageWorkgroupSize (../../third_party/dawn/src/dawn/native/ShaderModule.cpp:1004)

image

My environment : Python 3.11, MacOS 14.5 M2 Do you have any suggestions? Thanks.

I also encountered this problem when deployed on the web, probably because of webgpu limitations. You can use the configuration of the following code to tune.

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        ......
    )

I tested several configurations and succeeded under the “apple/m1-gpu-restricted” configuration.

senlyu163 avatar May 13 '24 03:05 senlyu163

@senlyu163 Thank you for your response. I’ve found multiple instances of ms.relax_integration.tune_relax in the TVM source code and made modifications to all the files accordingly. However, the issue still persists. Could you please provide detailed information on the files that require modification? Thank you very much.

abacaaaaaa avatar May 13 '24 03:05 abacaaaaaa

@abacaaaaaa

  1. You only need to specify the target as "apple/m1-gpu-restricted" when generating log_db in the tune phase.
  2. Just use the repo's default code during the lib compilation phase.

I think the problem you are having is that the number of gpu threads configuration is too large during the tune phase.

It is recommended that you refer to the guoyaol's repo: https://github.com/guoyaol/web-real-esrgan

senlyu163 avatar May 13 '24 04:05 senlyu163

@senlyu163 Sorry to bother you. I tried specifying apple/m1-gpu-restricted when generating the log_db, referring to https://github.com/guoyaol/web-real-esrgan and https://github.com/happyme531/RK3588-stable-diffusion-GPU, but the issue still persists.

My code is as follows:

def tune(mod: tvm.IRModule) -> None:
    from tvm import meta_schedule as ms

    ms.relax_integration.tune_relax(
        mod=mod,
        target=tvm.target.Target("apple/m1-gpu-restricted"),
        params={},
        builder=ms.builder.LocalBuilder(
            max_workers=2,
        ),
        runner=ms.runner.LocalRunner(),
        work_dir="log_db_tuning_1000_small",
        max_trials_global=5000,  
        max_trials_per_task=200, 
        strategy=ms.search_strategy.EvolutionarySearch(init_min_unmeasured=10, max_fail_count=15),
    )

I’m just tinkering with the parameters — max_trials_global, max_trials_per_task, init_min_unmeasured, and max_fail_count — and they might not be entirely reasonable. However, even after this tentative tuning, the current log_db still throws errors on the web. Could you kindly provide some suggestions? If it’s convenient for you, could you please share the tune file or log_db files? Thank you very much.

abacaaaaaa avatar May 16 '24 10:05 abacaaaaaa

@abacaaaaaa The tune code looks correct.

I verified webgpu on the web-realesrgan project without trying on stable-diffusion yet. So the log_db of sd cannot be provided. You can reproduce the work of web-realesrgan and then migrate to sd.

I am a freshman in tvm, so i can not provide more professional advice.

senlyu163 avatar May 17 '24 11:05 senlyu163

Hi, is there anyone who made progress with the webgpu.get_fmap error?

grf53 avatar Jul 08 '24 06:07 grf53

I found a way to pass through await tvm.asyncLoadWebGPUPipelines(vm.getInternalModule()); without webgpu.get_fmap error.

I realized that models built from https://github.com/mlc-ai/mlc-llm project don't have that problem. There are quite many differences, but I could narrow down the point.

There is build 'pipeline' when you build an Executable from the relax module. https://github.com/apache/tvm/blob/9e88018c3a56ab378dd11410a662ed5c3da1f4df/python/tvm/relax/vm_build.py#L259 Without any manipulation on pipeline, your option is only one of 'zero' or 'default'. Meanwhile the pipeline for mlc-llm build includes quite many pipeline steps(compile passes). https://github.com/mlc-ai/mlc-llm/blob/551f3fee7eaa09e6f024e2866510625842111c7d/python/mlc_llm/compiler_pass/pipeline.py#L77 Among those extra passes, the part that applying ones from tvm.dlight was the point. https://github.com/mlc-ai/mlc-llm/blob/551f3fee7eaa09e6f024e2866510625842111c7d/python/mlc_llm/compiler_pass/pipeline.py#L142-L148

After adding that step in the pipeline, the built wasm files don't make the webgpu.get_fmap problem.

But I still have trouble to use tvm with WebGPU. Copying TVM NDArray between devices(cpu <-> gpu) looks not properly working.

I hope this helps someone with the same issue and can be part of a constructive discussion to help me resolve my remaining issues.

grf53 avatar Jul 30 '24 16:07 grf53

@grf53 I'd like to express my gratitude to @grf53 for the solution regarding the "without webgpu.get_fmap" error. I attempted to implement the suggestion of incorporating tvm.dlight Schedules into the build pipeline. However, instead of resolving the initial error, I encountered a new issue after adding these schedules: the build process now fails for convolution operations with a kernel size of 1. This behavior exactly aligns with the problem described in https://github.com/mlc-ai/mlc-llm/issues/2276.

I would be immensely grateful if @grf53 could provide some additional guidance:

Specific Implementation Details: Could you kindly provide more details about how and where you precisely integrated the tvm.dlight Schedules into the build pipeline? Additional Adaptations: In addition to applying the tvm.dlight Schedules, were there any other modifications or adjustments you made to the code or the build process? Build Environment: If possible, could you please share details about your build environment? I'm currently working with Ubuntu 20.04 and TVM v0.16.0. Any further insights you can offer would be incredibly valuable in helping me overcome this hurdle. Thank you so much for your time and assistance!

Yumin-gd avatar Aug 22 '24 07:08 Yumin-gd

@Yumin-gd Hey, I could not help you for that issue, but let me just answer your questions.

Specific Implementation Details: Could you kindly provide more details about how and where you precisely integrated the tvm.dlight Schedules into the build pipeline?

I made my own function to return 'module pass(pipeline)' by mimicking https://github.com/mlc-ai/relax/blob/mlc/python/tvm/relax/pipeline.py. (I used mlc-ai/relax as tvm repo.)

def my_build_pipeline(
    ...
):
    from tvm import dlight as dl

    @tvm.transform.module_pass(opt_level=0)
    def _pipeline(mod: tvm.ir.IRModule, _ctx: tvm.transform.PassContext) -> tvm.ir.IRModule:
        seq = tvm.transform.Sequential(
            [
                ...
            ]
        )
        mod = seq(mod)
        return mod

    return _pipeline

In that pipeline, the following step is included along with the existing steps of default pipeline in https://github.com/mlc-ai/relax/blob/mlc/python/tvm/relax/pipeline.py.

[
    ...
    dl.ApplyDefaultSchedule(
        dl.gpu.Matmul(),
        dl.gpu.GEMV(),
        dl.gpu.Reduction(),
        dl.gpu.GeneralReduction(),
        dl.gpu.Fallback(),
    ),
    ...
]

And also, I just added that step into the existing default_build_pipeline() in the pipieline.py. And it also had the same effect.

Additional Adaptations: In addition to applying the tvm.dlight Schedules, were there any other modifications or adjustments you made to the code or the build process?

I just saw the file changes, and there is no other modification in the tvm(mlc-ai/relax) repo. I just modified codes using tvm library, but the pipeline part.

Build Environment: If possible, could you please share details about your build environment? I'm currently working with Ubuntu 20.04 and TVM v0.16.0.

I am using Macbook Pro with M3 Pro chip. The OS version is Sonoma 14.6.1 currently.(Not significantly different from that time.) And I used mlc-ai/relax repo as tvm, and I just used the default branch named 'mlc'. I have just checked the exact commit hash, that is f5f048bbd71513f087799f987019e3931f68a6d9, which looks like not currently on the 'mlc' branch.

Any further insights you can offer would be incredibly valuable in helping me overcome this hurdle. Thank you so much for your time and assistance!

I can't get what kind of things can help you. One thought that helped me while using tvm was that the project is still making progress so existing codes become easily out of date. And as I see, the direction of progress or maintenance is a lot focused on the 'currently attractive topics' like generative AI, LLM, etc. So maybe we can find the clue more quickly by researching that kind of things.

grf53 avatar Aug 25 '24 06:08 grf53