burn icon indicating copy to clipboard operation
burn copied to clipboard

MNIST Inference Web example not working

Open thekevinscott opened this issue 1 year ago • 11 comments

Describe the bug The MNIST Inference web example is not working. It appears to be trying to load pkg/mnist_inference_web.js but no such file exists.

To Reproduce

cd examples/mnist-inference-web/
./build-for-web.sh wgpu
./run-server.sh

Expected behavior I expect the example to work.

Screenshots Screenshot 2024-01-30 at 6 38 07 AM

Desktop (please complete the following information):

  • OS: MacOS
  • Browser Chrome
  • Version 120.0.6099.216 (Official Build) (arm64)

thekevinscott avatar Jan 30 '24 11:01 thekevinscott

./build-for-web.sh ndarray

appears to work, but introduces new errors into the console:

Screenshot 2024-01-30 at 6 40 48 AM
mnist_inference_web.js:315 panicked at /Users/thekevinscott/code/burn/burn-core/src/record/memory.rs:39:85:
called `Result::unwrap()` on an `Err` value: Utf8 { inner: Utf8Error { valid_up_to: 38, error_len: Some(1) } }

Stack:

Error
    at imports.wbg.__wbg_new_abda76e883ba8a5f (http://localhost:8000/pkg/mnist_inference_web.js:299:21)
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[187]:0x4c6cb
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[237]:0x51717
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[206]:0x4d414
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[107]:0x39ce6
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[125]:0x477a5
    at http://localhost:8000/pkg/mnist_inference_web_bg.wasm:wasm-function[265]:0x51bbb
    at __wbg_adapter_18 (http://localhost:8000/pkg/mnist_inference_web.js:75:10)
    at real (http://localhost:8000/pkg/mnist_inference_web.js:60:20)

Uncaught RuntimeError: unreachable
    at mnist_inference_web_bg.wasm:0x4c7de
    at mnist_inference_web_bg.wasm:0x51717
    at mnist_inference_web_bg.wasm:0x4d414
    at mnist_inference_web_bg.wasm:0x39ce6
    at mnist_inference_web_bg.wasm:0x477a5
    at mnist_inference_web_bg.wasm:0x51bbb
    at __wbg_adapter_18 (mnist_inference_web.js:75:10)
    at real (mnist_inference_web.js:60:20)

Uncaught Error: recursive use of an object detected which would lead to unsafe aliasing in rust
    at imports.wbg.__wbindgen_throw (mnist_inference_web.js:321:15)
    at mnist_inference_web_bg.wasm:0x5229a
    at mnist_inference_web_bg.wasm:0x522b6
    at mnist_inference_web_bg.wasm:0x4516a
    at mnist_inference_web_bg.wasm:0x477a5
    at mnist_inference_web_bg.wasm:0x51bbb
    at __wbg_adapter_18 (mnist_inference_web.js:75:10)
    at real (mnist_inference_web.js:60:20)

thekevinscott avatar Jan 30 '24 11:01 thekevinscott

I saw the same thing with ndarray, I was able to fix it by re-creating the model in the mnist example and copying model.bin to this example. Maybe something to do with the different cpu architecture the models are created on?

I cannot get wgpu to work, but I see a different error, and also different errors in different browsers. I enabled the features for webgpu and webassembly in Chrome and Brave, but wondering if there are some non-obvious options that also need to be enabled.

error in chrome: panicked at 'called Option::unwrap()on aNone value', /Users/eric/Downloads/burn/burn-wgpu/src/compute/base.rs:120

error in brave: panicked at 'An home directory should exist', burn-compute/src/tune/tune_cache.rs:25

The home directory definitely exists, so I don't know what is going on there, especially since the ndarray example works. I tried enabling the Shared GPUImageDecodeCache browser flag, along with rasterization, but still see the same error.

ericcarmi avatar Feb 01 '24 23:02 ericcarmi

@ericcarmi Maybe we need to update the record. With the bin recorder, versions can be problematic. Testing with wgpu isn't trivial since not all platform support WebGPU.

nathanielsimard avatar Feb 02 '24 17:02 nathanielsimard

We need to review to see if the model changed. If the model does not match the record, you'll have a mismatch. I propose we store a string representation of the record as part of the metadata.

antimora avatar Feb 02 '24 18:02 antimora

Yeah, that was it. Retraining the model worked w/ ndarray backend image

Unable to get the wgpu backend working on chrome following these steps: https://github.com/tensorflow/tfjs/issues/8065#issuecomment-1808785524

image

jacobdineen avatar Mar 06 '24 13:03 jacobdineen

@ericcarmi Maybe we need to update the record. With the bin recorder, versions can be problematic. Testing with wgpu isn't trivial since not all platform support WebGPU.

Isn't a headless chrome + puppeter a solution? https://developer.chrome.com/blog/supercharge-web-ai-testing Seems like a good start.

iSuslov avatar Apr 20 '24 01:04 iSuslov

Example doesn't compile currently, since the import location and signature of init_async (for the Wgpu backend) changed

ActuallyHappening avatar May 26 '24 09:05 ActuallyHappening

I created a PR to fix some issues with the wasm examples: #1824

nathanielsimard avatar May 27 '24 13:05 nathanielsimard

Hi, @nathanielsimard @antimora

Thank you for your fixing the PR. When I tried the mnist-inference-web, I was faced with the issues as follows in both wgpu and ndarray so I report it to you here and show my suggestion. I tried re-creating model.bin by executing original mnist and copying it, but could not fix the bug.

Desktop (please complete the following information): OS: MacOS Browser Chrome Version 127.0.6533.100(Official Build) (arm64)

Summary and Suggestion

I suggest using onnx model from https://github.com/tracel-ai/burn/tree/main/examples/onnx-inference, instead of using model.bin.

  1. The key point of this example seems to fusion burn and wasm to render functionalities on browsers. Either of using binary model or onnx model looks not important.
  2. The model.bin is created by https://github.com/tracel-ai/burn/tree/main/examples/mnist due to readme. Looks like it is hard and a hassle to catch up on the changes of original mnist codes and binary model by modifying the model.rs.
  3. When adopting onnx model style, we do not have to take care of the above mentioned issue and it is likely to be easier to prevent a regression.

If the above content looks good to you, I will be happy to be in charge of the task.

Case webgpu

reproducing command

% cd examples/mnist-inference-web 
% ./build-for-web.sh wgpu
% ./run-server.sh

webgpu-image

Case ndarray

reproducing command

% cd examples/mnist-inference-web
% ./build-for-web.sh ndarray             
% ./run-server.sh           

ndarray-image

Difference between original mnist model and mnist-inference-web model

Sorry I could see the entire differences. At least, the comment // Originally copied from the burn/examples/mnist package may be incorrect and make some confusions, even if exluding the training and validation sections.

% diff -c examples/mnist/src/model.rs examples/mnist-inference-web/src/model.rs
*** examples/mnist/src/model.rs	Wed Jul 10 10:50:54 2024
--- examples/mnist-inference-web/src/model.rs	Mon Aug 12 17:44:35 2024
***************
*** 1,9 ****
! use crate::data::MnistBatch;
  use burn::{
!     nn::{loss::CrossEntropyLossConfig, BatchNorm, PaddingConfig2d},
      prelude::*,
-     tensor::backend::AutodiffBackend,
-     train::{ClassificationOutput, TrainOutput, TrainStep, ValidStep},
  };
  
  #[derive(Module, Debug)]
--- 1,10 ----
! #![allow(clippy::new_without_default)]
! 
! // Originally copied from the burn/examples/mnist package
! 
  use burn::{
!     nn::{BatchNorm, PaddingConfig2d},
      prelude::*,
  };
  
  #[derive(Module, Debug)]
***************
*** 17,29 ****
      activation: nn::Gelu,
  }
  
- impl<B: Backend> Default for Model<B> {
-     fn default() -> Self {
-         let device = B::Device::default();
-         Self::new(&device)
-     }
- }
- 
  const NUM_CLASSES: usize = 10;
  
  impl<B: Backend> Model<B> {
--- 18,23 ----
***************
*** 45,53 ****
              conv1,
              conv2,
              conv3,
-             dropout,
              fc1,
              fc2,
              activation: nn::Gelu::new(),
          }
      }
--- 39,47 ----
              conv1,
              conv2,
              conv3,
              fc1,
              fc2,
+             dropout,
              activation: nn::Gelu::new(),
          }
      }
***************
*** 69,88 ****
  
          self.fc2.forward(x)
      }
- 
-     pub fn forward_classification(&self, item: MnistBatch<B>) -> ClassificationOutput<B> {
-         let targets = item.targets;
-         let output = self.forward(item.images);
-         let loss = CrossEntropyLossConfig::new()
-             .init(&output.device())
-             .forward(output.clone(), targets.clone());
- 
-         ClassificationOutput {
-             loss,
-             output,
-             targets,
-         }
-     }
  }
  
  #[derive(Module, Debug)]
--- 63,68 ----
***************
*** 111,129 ****
          let x = self.norm.forward(x);
  
          self.activation.forward(x)
...

If the above content looks good to you, I will be happy to be in charge of the task.

Thank you.

tiruka avatar Aug 13 '24 08:08 tiruka

@tiruka, we can use ONNX model since it's more stable but I feel like it changes the nature of the example. We may have a tons of existing references and documentation for this web example. Also I feel like it will confuse others by making them think one needs ONNX to build for web. We already have image-classification-web that uses ONNX file already.

I think a proper way forward is an addition of a test that loads model.bin (panics if fails) that is also hooked up to our CI. We should be fixing early.

antimora avatar Aug 13 '24 15:08 antimora

@antimora Thank you for your replying.

I think a proper way forward is an addition of a test that loads model.bin (panics if fails) that is also hooked up to our CI. We should be fixing early.

I understand your thought and withdraw my proposal. I will cooperate with you to fix the bug if necessary and you let me know.

tiruka avatar Aug 13 '24 22:08 tiruka