[js/webgpu] Float16Array polyfill for uniform

Open axinging opened this issue 1 year ago • 1 comments

To use this feature, first create a f16 model from f32 model:

import onnxruntime as ort
import onnx
from onnx import helper as helper
from onnx import TensorProto as tp
import numpy as np
import onnx
from onnxconverter_common import float16

model = onnx.load("pad_constant_f32_opset8.onnx")
model_fp16 = float16.convert_float_to_float16(model)
onnx.save(model_fp16, "pad_constant_f16_opset8.onnx")

Then test it like this:

  <script src="./web/dist/ort.all.js"></script>
  <script src="https://cdn.jsdelivr.net/npm/@petamoriken/float16/browser/float16.min.js"></script>
  <script>
    const { Float16Array } = float16;
    async function main() {
      try {
        const session = await ort.InferenceSession.create('./pad_constant_f16_opset8.onnx', { executionProviders: ['webgpu'] });
        let dataA = Float16Array.from([1.2, 2.5, 3.5, 4, 5, 6.1, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12,
          1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12]);

        const tensorA = new ort.Tensor('float16', dataA, [1, 3, 4, 5]);
        // prepare feeds. use model input names as keys.
        const feeds = { x: tensorA };
        const results = await session.run(feeds);
        // read from results
        const dataC = results.y.cpuData;
        console.log(`data of result tensor 'c': ${dataC}`);
      } catch (e) {
        console.error(`failed to inference ONNX model: ${e}.`);
      }
    }

    main();
  </script>

Jan 29 '24 08:01 axinging

It seems that for operator Pad, if the type variance T is float16, the operator's 3rd input (input[2]) will be of float16 as well, and our code read the scalar value into CPU. This is the reason for why introduce Float16Array polyfill into the code base.

we need carefully think of this situation. Do we really need to use the value in CPU? At least for this case, the answer is no.

in the Pad's usage, what the code actually does, is to load the 2-bytes data to CPU, and then copy it to GPU via uniform. The CPU code (in JavaScript) does not really use that value.

So, we can think of an imperfect solution: to use Uint16Array to "carry" the data for an float16 value. Since float64 <-> Uint16 is a stable mapping, we can always use Uint16Array/number to represent CPU data of f16 in a safe manner.

The change included in this PR goes out of scope. There are actually 3 problems that we are aiming to resolve: (1) support users to use any Float16Array polyfill from API, if it's available; (2) support f16 uniform passing from CPU (yes it has to be on CPU to being used for uniform variable anyway), and (3) support f16 in unittest (consume a real f16 polyfill via devDependencies). They can be work on 3 different PRs.

Jan 31 '24 03:01 fs-eire