llama-recipes icon indicating copy to clipboard operation
llama-recipes copied to clipboard

Wrong truncation of training examples in alpaca dataset

Open YosiMass opened this issue 1 year ago • 2 comments

System Info

PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A

OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64) GCC version: (GCC) 10.1.0 Clang version: Could not collect CMake version: version 3.27.4 Libc version: glibc-2.28

Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-477.15.1.el8_8.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.54.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True

Information

  • [X] The official example scripts
  • [ ] My own modified scripts

🐛 Describe the bug

in datasets/alpaca_dataset in getitem

The code builds the training example with all three parts (instruction, input, response) and if it is larger than max_words, the code just removes the last tokens. As a result the response might be removed

Error logs

no error message. Just wrong behavior

Expected behavior

The fix should be to truncate the (instruction+input) and keep the full response, such that overall it will fit into max_words.

YosiMass avatar Sep 19 '23 15:09 YosiMass