candle some token duplicated in candle-examples trocr

some token duplicated in candle-examples trocr

Open artavash opened this issue 7 months ago • 0 comments

Hello, there appears to be a bug in the trocr example. Possibly with the way images are tokenized.

For example, this image produces 754754.7 instead of 754.7

I have added some debug print statements and I notice that the token 39382 associated with 754 appears twice (at start_pos 1 and start_pos 2).

cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/bug.png
    Finished release [optimized] target(s) in 3.11s
     Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/bug.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
context_size - 1 -
start_pos - 0 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 39382 -
t: 754
------------
context_size - 1 -
start_pos - 1 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 39382 -
t: 754
------------
context_size - 1 -
start_pos - 2 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 4 -
------------
context_size - 1 -
start_pos - 3 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 406 -
t: .7
------------
context_size - 1 -
start_pos - 4 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 821 -
t:  g
------------
context_size - 1 -
start_pos - 5 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 2 -
------------

On the other hand, an almost identical image that begins and ends with a 2 instead of a 7 is OK.

cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/nobug.png
    Finished release [optimized] target(s) in 3.32s
     Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/nobug.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
context_size - 1 -
start_pos - 0 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 30959 -
t: 254
------------
context_size - 1 -
start_pos - 1 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 4 -
------------
context_size - 1 -
start_pos - 2 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 176 -
t: .2
------------
context_size - 1 -
start_pos - 3 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 2 -
------------

The model itself appears to be fine. Checked out like this using the API:

Just in case there is some uncertainty about the "g" being a "g" or an "8" and maybe that's throwing it off, I tried it without the "g". Here is what we have (no debug print): again 754 segment is there twice.

cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/nog.png
   Compiling candle-examples v0.6.0 (/mnt/c/Users/xxx/RUST/candle/candle-examples)
    Finished release [optimized] target(s) in 26.00s
     Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/nog.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
encoder_xs: Tensor[dims 1, 577, 768; f32]
754754.7

I'm not sure if there is an extra token being passed to the model, or the returning token from the model being accessed twice. Maybe it has something to do with the differences between the Rust image library and whatever they use for image parsing on huggingface.co API. But what makes the digit 7 special anyway? And any hint where to look next?

Thanks!

Jun 30 '24 20:06 artavash

candle candle copied to clipboard

some token duplicated in candle-examples trocr

candle
candle copied to clipboard