LMFlow icon indicating copy to clipboard operation
LMFlow copied to clipboard

Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases

Open Yuncong-Cao opened this issue 9 months ago • 2 comments

call stack diagram for dataset

Yuncong-Cao avatar Apr 06 '25 21:04 Yuncong-Cao

call stack diagram for dataset

and tokenization

Yuncong-Cao avatar Apr 07 '25 02:04 Yuncong-Cao

Code changes based on tokenization tests

  1. I updated ConversationTemplate.encode_conversation to drop any unpaired final message when there’s an odd count and return only the paired turns; if the first message isn’t from the user, I skip encoding and return an empty list.
  2. I also tweaked both hf_decoder_model.py and hf_text_regression_model.py so that if a ConversationTemplate lacks a system_formatter, I set system=None before calling encode_conversation, avoiding ValueError on unformatted system prompts.

Yuncong-Cao avatar Apr 21 '25 00:04 Yuncong-Cao