tokenizers icon indicating copy to clipboard operation
tokenizers copied to clipboard

Cannot download test data: 'make test' and direct links fail with "Repository not found" / 404

Open 8ria opened this issue 5 months ago • 1 comments

Describe the bug I am unable to download the necessary test data for added_tokens.rs and other integration tests. Running cargo test --test added_tokens results in "Files not found" errors, specifically: Files not found, run make test to download these files: Os { code: 2, kind: NotFound, message: "The system cannot find the file specified." }

Steps to Reproduce

  1. Clone the tokenizers repository (or pull to main if already cloned).
  2. Attempt to run the integration tests: cargo test --test added_tokens (fails).
  3. Attempt to use the previously documented method to download test data: make test (fails with 'command not found' if make is not installed, or even if it is, the underlying script seems to be missing/inaccessible).
  4. Attempt to manually run the download script: bash scripts/download-test-data.sh (fails with "No such file or directory" because the scripts/ folder is no longer present on main).
  5. Attempt to download the test-data.zip directly via browser from the previously provided Hugging Face dataset URL: https://huggingface.co/datasets/huggingface/tokenizers-test-data/resolve/main/test-data.zip (fails with "Repository not found" or 404).
  6. Attempt to access the scripts/ folder on GitHub's main branch: https://github.com/huggingface/tokenizers/tree/main/scripts (results in a 404 error, indicating the folder is gone).

Expected behavior added_tokens.rs and other integration tests should pass after successfully downloading the required test data. The test data should be accessible via make test or a clear, public download link.

Screenshots/Error Messages (You can copy-paste the specific error messages you've shown me, like the Files not found... and the Repository not found messages for the URL.)

Environment:

  • OS: Windows 10/11 (or your specific version)
  • Shell: Git Bash / PowerShell
  • Rust version: (e.g., rustc --version)

Additional context This issue significantly impacts local development and testing, making it difficult for new contributors to verify changes against the full test suite. My primary change (an optimized whitespace pre-tokenizer) is ready, and 192 core unit tests pass, but these integration tests are blocked by this data accessibility issue.

8ria avatar Jul 07 '25 03:07 8ria

Thanks for reporting! will try to fix

ArthurZucker avatar Jul 29 '25 13:07 ArthurZucker