rsmtool
rsmtool copied to clipboard
Bug in rsmxval with id column composed of integers
I was trying to run an rsmxval experiment with a handful of features and ran into the following issue (something I did not get when I ran a single rsmtool experiment using the same data for one train/test split):
$ rsmxval RandomForest.json RandomForest
Output directory: RandomForest
Saving configuration file.
Generating 5 folds after shuffling
Running RSMTool on each fold in parallel
[cut for brevity...]
Creating fold summary
Traceback (most recent call last):
File "/Users/mmulholland/Documents/edusoft/model_generic/aggregation/rsmtool_experiment/../../../env/bin/rsmxval", line 10, in <module>
sys.exit(main())
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/rsmtool/rsmxval.py", line 286, in main
run_cross_validation(abspath(args.config_file),
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/rsmtool/rsmxval.py", line 164, in run_cross_validation
df_predictions = df_predictions.merge(df_to_add,
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/frame.py", line 9354, in merge
return merge(
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
op = _MergeOperation(
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
self._maybe_coerce_merge_keys()
File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1261, in _maybe_coerce_merge_keys
raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat
When I inspected my data a little closer, I saw that the ID column was composed of int-able values. So, I tried to reproduce the failure on a small set of data (100 rows) and then on the same data but with the string "id" tacked onto each int-able value (e.g. 0 -> id0). I could reproduce the failure for the former and not for the latter.
This is the offending block of code in rsmxval.py:
if len(columns_to_use) > 1:
df_to_add = df_train[columns_to_use]
df_predictions = df_predictions.merge(df_to_add,
left_on="spkitemid",
right_on=id_column)
I have attached the feature file and the config used.
RandomForest.json:
{
"experiment_id": "RandomForestRegressor",
"train_file": "./all_models_features.csv",
"model": "RandomForestRegressor",
"select_transformations": true,
"description": "Use model predictions, etc., as features in a RandomForestRegressor model.",
"test_label_column": "score",
"train_label_column": "score",
"use_scaled_predictions": true,
"skll_objective": "quadratic_weighted_kappa",
"subgroups": ["prompt"],
"use_thumbnails": true,
"trim_min": 0,
"trim_max": 2,
"id_column": "id",
"exclude_zero_scores": false
}