Bug in rsmxval with id column composed of integers

Open mulhod opened this issue 3 years ago • 0 comments

I was trying to run an rsmxval experiment with a handful of features and ran into the following issue (something I did not get when I ran a single rsmtool experiment using the same data for one train/test split):

$ rsmxval RandomForest.json RandomForest
Output directory: RandomForest
Saving configuration file.
Generating 5 folds after shuffling
Running RSMTool on each fold in parallel

[cut for brevity...]

Creating fold summary
Traceback (most recent call last):
  File "/Users/mmulholland/Documents/edusoft/model_generic/aggregation/rsmtool_experiment/../../../env/bin/rsmxval", line 10, in <module>
    sys.exit(main())
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/rsmtool/rsmxval.py", line 286, in main
    run_cross_validation(abspath(args.config_file),
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/rsmtool/rsmxval.py", line 164, in run_cross_validation
    df_predictions = df_predictions.merge(df_to_add,
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/frame.py", line 9354, in merge
    return merge(
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 107, in merge
    op = _MergeOperation(
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 704, in __init__
    self._maybe_coerce_merge_keys()
  File "/Users/mmulholland/Documents/edusoft/env/lib/python3.10/site-packages/pandas/core/reshape/merge.py", line 1261, in _maybe_coerce_merge_keys
    raise ValueError(msg)
ValueError: You are trying to merge on object and int64 columns. If you wish to proceed you should use pd.concat

When I inspected my data a little closer, I saw that the ID column was composed of int-able values. So, I tried to reproduce the failure on a small set of data (100 rows) and then on the same data but with the string "id" tacked onto each int-able value (e.g. 0 -> id0). I could reproduce the failure for the former and not for the latter.

This is the offending block of code in rsmxval.py:

if len(columns_to_use) > 1:
    df_to_add = df_train[columns_to_use]
    df_predictions = df_predictions.merge(df_to_add,
                                          left_on="spkitemid",
                                          right_on=id_column)

I have attached the feature file and the config used.

all_models_features.csv

RandomForest.json:

{
    "experiment_id": "RandomForestRegressor",
    "train_file": "./all_models_features.csv",
    "model": "RandomForestRegressor",
    "select_transformations": true,
    "description": "Use model predictions, etc., as features in a RandomForestRegressor model.",
    "test_label_column": "score",
    "train_label_column": "score",
    "use_scaled_predictions": true,
    "skll_objective": "quadratic_weighted_kappa",
    "subgroups": ["prompt"],
    "use_thumbnails": true,
    "trim_min": 0,
    "trim_max": 2,
    "id_column": "id",
    "exclude_zero_scores": false
}

Sep 14 '22 00:09 mulhod