h2o-llmstudio
h2o-llmstudio copied to clipboard
[CODE IMPROVEMENT] if user specifies a “system” column but it doesnt exist, it should error out instead of continue running silently
🔧 Proposed code refactoring
if system column not in train dataframe.coljmns or in valid columns, then error out
Motivation
Otherwise user might erroneously believe they are using a system column
Conversation_chain_handler.py L140
change from a simple log to a raise error? There is so much stuff being printed in the log that the average person would miss the warning
How exactly is it possible to specify a column that does not exist?
How exactly is it possible to specify a column that does not exist?
I guess the issue is referring to the case if the training Dataframe contains a system column, but validation does not.
Conversation_chain_handler.py L140 change from a simple log to a raise error?
To keep the pipeline flexible, one should not raise an issue here. One may use a common evaluation datasets across different experiments (mt-bench, company specific evaluation dataset, ...) that does not contain any system column.
As a low-priority issue, one could think about adding Dataframe checks before running an experiment (alongside cfg checks). For now, logging a warning is sufficient IMO.
No, it doesn’t have to do with train vs valid. Just use any csv file, and in your config.yaml for training, type system=“column_that_doesnt_exist”. The code will still run, it will log a small error saying that the System column was not found. I’m suggesting that instead of logging that, you should just raise an AssertionError
Thanks for the clarification!
As mentioned, the reason to not raise an AssertionError
but rather a warning for system prompt missing is intentional.
I'd go into the direction of adding DataFrame checks to check_config_for_errors
and making them runnable via the command line.