candle
candle copied to clipboard
some token duplicated in candle-examples trocr
Hello, there appears to be a bug in the trocr example. Possibly with the way images are tokenized.
For example, this image produces 754754.7 instead of 754.7
I have added some debug print statements and I notice that the token 39382 associated with 754 appears twice (at start_pos 1 and start_pos 2).
cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/bug.png
Finished release [optimized] target(s) in 3.11s
Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/bug.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
context_size - 1 -
start_pos - 0 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 39382 -
t: 754
------------
context_size - 1 -
start_pos - 1 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 39382 -
t: 754
------------
context_size - 1 -
start_pos - 2 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 4 -
------------
context_size - 1 -
start_pos - 3 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 406 -
t: .7
------------
context_size - 1 -
start_pos - 4 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 821 -
t: g
------------
context_size - 1 -
start_pos - 5 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 2 -
------------
On the other hand, an almost identical image that begins and ends with a 2 instead of a 7 is OK.
cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/nobug.png
Finished release [optimized] target(s) in 3.32s
Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/nobug.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
context_size - 1 -
start_pos - 0 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 30959 -
t: 254
------------
context_size - 1 -
start_pos - 1 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 4 -
------------
context_size - 1 -
start_pos - 2 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 176 -
t: .2
------------
context_size - 1 -
start_pos - 3 -
input_ids: - Tensor[dims 1, 1; u32] -
token - 2 -
------------
The model itself appears to be fine. Checked out like this using the API:
Just in case there is some uncertainty about the "g" being a "g" or an "8" and maybe that's throwing it off, I tried it without the "g". Here is what we have (no debug print): again 754 segment is there twice.
cargo run --example trocr --release -- --which base-printed --image candle-examples/examples/trocr/assets/nog.png
Compiling candle-examples v0.6.0 (/mnt/c/Users/xxx/RUST/candle/candle-examples)
Finished release [optimized] target(s) in 26.00s
Running `target/release/examples/trocr --which base-printed --image candle-examples/examples/trocr/assets/nog.png`
Running on CPU, to run on GPU, build this example with `--features cuda`
model: "/home/xxx/.cache/huggingface/hub/models--microsoft--trocr-base-printed/snapshots/24216f24cd78fe1a9c8b4e6e4565aec5c9220e63/model.safetensors"
encoder_xs: Tensor[dims 1, 577, 768; f32]
754754.7
I'm not sure if there is an extra token being passed to the model, or the returning token from the model being accessed twice. Maybe it has something to do with the differences between the Rust image library and whatever they use for image parsing on huggingface.co API. But what makes the digit 7 special anyway? And any hint where to look next?
Thanks!