llama-recipes
llama-recipes copied to clipboard
Wrong truncation of training examples in alpaca dataset
System Info
PyTorch version: 2.0.1+cu117 Is debug build: False CUDA used to build PyTorch: 11.7 ROCM used to build PyTorch: N/A
OS: Red Hat Enterprise Linux release 8.8 (Ootpa) (x86_64) GCC version: (GCC) 10.1.0 Clang version: Could not collect CMake version: version 3.27.4 Libc version: glibc-2.28
Python version: 3.10.13 (main, Sep 11 2023, 13:44:35) [GCC 11.2.0] (64-bit runtime) Python platform: Linux-4.18.0-477.15.1.el8_8.x86_64-x86_64-with-glibc2.28 Is CUDA available: True CUDA runtime version: 11.6.55 CUDA_MODULE_LOADING set to: LAZY GPU models and configuration: GPU 0: NVIDIA A100-SXM4-80GB Nvidia driver version: 535.54.03 cuDNN version: Could not collect HIP runtime version: N/A MIOpen runtime version: N/A Is XNNPACK available: True
Information
- [X] The official example scripts
- [ ] My own modified scripts
🐛 Describe the bug
in datasets/alpaca_dataset in getitem
The code builds the training example with all three parts (instruction, input, response) and if it is larger than max_words, the code just removes the last tokens. As a result the response might be removed
Error logs
no error message. Just wrong behavior
Expected behavior
The fix should be to truncate the (instruction+input) and keep the full response, such that overall it will fit into max_words.