tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

bc3ec39d breaks the compilation (as noted in #1355)

Open baptisterajaut opened this issue 9 months ago • 12 comments

As stated, this commit breaks building the tokenizers on modern toolchains, even stable

error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
         --> tokenizers-lib/src/models/bpe/trainer.rs:526:47
          |
      522 |                     let w = &words[*i] as *const _ as *mut _;
          |                             -------------------------------- casting happend here
      ...
      526 |                         let word: &mut Word = &mut (*w);
          |                                               ^^^^^^^^^
          |

% rustc -V rustc 1.73.0 (cc66ad468 2023-10-03)

baptisterajaut avatar Oct 08 '23 17:10 baptisterajaut

Tokenizers cannot be installed for me too. It is being installed as part of the Allen-NLP package and the new version of the Rust compiler breaks it.

Installing Rust via the Rust site using their shell script installs 1.73.0 I presume and breaks the Tokenizers compilation, but installing it via Homebrew installs 1.72.1, which is works.

adwaraki avatar Oct 28 '23 01:10 adwaraki

Which version are you using.

This was fixed already on main and 0.14.1

https://github.com/huggingface/tokenizers/blob/main/tokenizers/src/models/bpe/trainer.rs#L541-L546

Narsil avatar Oct 30 '23 10:10 Narsil

To escape from this error, I install transformers with conda, which uses command 'conda install -c huggingface transformers'. then it works.

Songcheng-Xie avatar Nov 06 '23 05:11 Songcheng-Xie

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Dec 07 '23 01:12 github-actions[bot]

I have the same problem with Python 3.11 do you need more information about this issue?

DavidAdamczyk avatar Dec 09 '23 17:12 DavidAdamczyk

@DavidAdamczyk Use a more recent tokenizers version, or an older Rust compiler version.

Narsil avatar Dec 09 '23 22:12 Narsil

I use the latest version of tokenizers and the most recent stable version of the Rust compiler. Additionally, I follow the installation instructions available here. Could someone update the installation instructions and include information about the supported versions of all dependencies?

DavidAdamczyk avatar Dec 10 '23 10:12 DavidAdamczyk

Hey Hi, This same error has happened with me I am trying to install transformers v 4.6.1 on Pyng z2 board (v2.5 {arm7l}) with rust v 1.74.1

Edit: Strategy to solve this error is to use older rust version -> (What I did)

  1. install rust v1.72.1 rustup default 1.72.1
  2. Remove rust stable or set environment variable to make sure that compilation does not use rust stable rustup toolchain remove stable or export RUSTUP_TOOLCHAIN=1.72.1

After this It should work properly

Mr-AniP avatar Dec 23 '23 12:12 Mr-AniP

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Jan 25 '24 01:01 github-actions[bot]

pip3 install transformers==4.15.0 timm==0.4.12 fairscale==0.4.4

  error: casting `&T` to `&mut T` is undefined behavior, even if the reference is unused, consider instead using an `UnsafeCell`
     --> tokenizers-lib\src\models\bpe\trainer.rs:517:47
      |
  513 |                     let w = &words[*i] as *const _ as *mut _;
      |                             -------------------------------- casting happend here
  ...
  517 |                         let word: &mut Word = &mut (*w);
      |                                               ^^^^^^^^^
      |
      = note: for more information, visit <https://doc.rust-lang.org/book/ch15-05-interior-mutability.html>
      = note: `#[deny(invalid_reference_casting)]` on by default

running into this tonight too.

Requirement already satisfied: requests in c:\users\dhorner\anaconda3\envs\hotz\lib\site-packages (from transformers==4.15.0->-r requirements.txt (line 2)) (2.31.0) Collecting sacremoses (from transformers==4.15.0->-r requirements.txt (line 2)) Using cached sacremoses-0.1.1-py3-none-any.whl.metadata (8.3 kB) Collecting tokenizers<0.11,>=0.10.1 (from transformers==4.15.0->-r requirements.txt (line 2)) Using cached tokenizers-0.10.3.tar.gz (212 kB)

THE SOLUTION FOR ME WAS TO SET RUSTFLAGS=-A invalid_reference_casting worked for me in 1.75.0

davehorner avatar Jan 25 '24 03:01 davehorner

Also ran in to this issue last week, installing transformers==4.22.1 pinned by a different project. tokenizers resolved to v0.12.1. Platform was macOS Sonoma, M2 chip.

I also worked around by running:

export RUSTFLAGS="-A invalid_reference_casting"

...before installing, but it'd be great if the problem could be tackled at source!

athewsey avatar Feb 13 '24 10:02 athewsey

I would love to be the one to help resolve this further than a environment flag.

tokenizers-lib/src/models/bpe/trainer.rs:526

I do not see tokenizers-lib in tree. rg "let w = &words[*i] as *const _ as *mut _;" finds nothing

The error guidance is not clear. GPT says: This error message indicates that you're attempting to cast a shared reference (&T) into a mutable reference (&mut T), which is considered undefined behavior in Rust, even if the mutable reference is not actually used. Rust's safety guarantees rely on preventing such unsound operations.

To resolve this issue, you should use appropriate safe patterns for mutable access, such as Cell, RefCell, or UnsafeCell for interior mutability, depending on your specific use case.

In your case, since you're dealing with mutable access to data through raw pointers, you should consider using UnsafeCell. Here's how you can adjust your code:

use std::cell::UnsafeCell;

// Assuming Word is some struct or type you're working with
struct Word {
    // fields of Word
}

// Assuming words is some collection of Word
let words: Vec<Word> = /* initialization of words */;

// Assuming i is some index into the words vector
let i = /* index */;

// Accessing the word at index i in a mutable way
let w = &words[i] as *const _ as *mut UnsafeCell<Word>;
let word: &UnsafeCell<Word> = unsafe { &*w };
let word_mut: &mut Word = unsafe { &mut *word.get() };

However, using UnsafeCell requires careful handling as it bypasses Rust's safety checks. Make sure you understand the implications of using UnsafeCell and ensure that your code is correct and safe.

Alternatively, consider restructuring your code to avoid mutable raw pointer access if possible, as raw pointer manipulation can be error-prone and harder to reason about compared to safe Rust constructs.

so Rustonomicon.

If someone can orient me to where the code is. I don't know where it lives.

davehorner avatar Feb 17 '24 14:02 davehorner

I'll close this as the latest releases don't have this issue anymore I believe

ArthurZucker avatar Feb 19 '24 02:02 ArthurZucker