h2o-llmstudio icon indicating copy to clipboard operation
h2o-llmstudio copied to clipboard

[CODE IMPROVEMENT] if user specifies a “system” column but it doesnt exist, it should error out instead of continue running silently

Open Quetzalcohuatl opened this issue 1 year ago • 5 comments

🔧 Proposed code refactoring

if system column not in train dataframe.coljmns or in valid columns, then error out

Motivation

Otherwise user might erroneously believe they are using a system column

Quetzalcohuatl avatar Jan 31 '24 15:01 Quetzalcohuatl

Conversation_chain_handler.py L140

change from a simple log to a raise error? There is so much stuff being printed in the log that the average person would miss the warning

Quetzalcohuatl avatar Jan 31 '24 15:01 Quetzalcohuatl

How exactly is it possible to specify a column that does not exist?

psinger avatar Feb 05 '24 11:02 psinger

How exactly is it possible to specify a column that does not exist?

I guess the issue is referring to the case if the training Dataframe contains a system column, but validation does not.

Conversation_chain_handler.py L140 change from a simple log to a raise error?

To keep the pipeline flexible, one should not raise an issue here. One may use a common evaluation datasets across different experiments (mt-bench, company specific evaluation dataset, ...) that does not contain any system column.

As a low-priority issue, one could think about adding Dataframe checks before running an experiment (alongside cfg checks). For now, logging a warning is sufficient IMO.

maxjeblick avatar Feb 05 '24 11:02 maxjeblick

No, it doesn’t have to do with train vs valid. Just use any csv file, and in your config.yaml for training, type system=“column_that_doesnt_exist”. The code will still run, it will log a small error saying that the System column was not found. I’m suggesting that instead of logging that, you should just raise an AssertionError

Quetzalcohuatl avatar Feb 05 '24 11:02 Quetzalcohuatl

Thanks for the clarification! As mentioned, the reason to not raise an AssertionError but rather a warning for system prompt missing is intentional.

I'd go into the direction of adding DataFrame checks to check_config_for_errors and making them runnable via the command line.

maxjeblick avatar Feb 05 '24 11:02 maxjeblick