Add comprehensive tokenization tests, update diagram, and adjust code to handle edge cases

Open Yuncong-Cao opened this issue 9 months ago • 2 comments

call stack diagram for dataset

Apr 06 '25 21:04 Yuncong-Cao

call stack diagram for dataset

and tokenization

Apr 07 '25 02:04 Yuncong-Cao

Code changes based on tokenization tests

I updated ConversationTemplate.encode_conversation to drop any unpaired final message when there’s an odd count and return only the paired turns; if the first message isn’t from the user, I skip encoding and return an empty list.
I also tweaked both hf_decoder_model.py and hf_text_regression_model.py so that if a ConversationTemplate lacks a system_formatter, I set system=None before calling encode_conversation, avoiding ValueError on unformatted system prompts.

Apr 21 '25 00:04 Yuncong-Cao