evalml icon indicating copy to clipboard operation
evalml copied to clipboard

Change `_schema_is_equal` to allow nullable and non nullable types to be used interchangably between train and test data

Open tamargrey opened this issue 2 years ago • 0 comments

As a user, I wish I could train a pipeline on data that might not have nans and has non nullable types and then predict/transform_all_but_final /score data that has nans and therefore has nullable types.

Currently, having nullable types at train and non nullable types at test (or visa versa) causes ComponentGraph._transform_features to error with Input X data types are different from the input types the pipeline was fitted on, but other than whether or not they may contain null values, nullable types and their non nullable counterparts contain the same type of data.

Once the nullability epic is in place, we may see increased usage of nullable types, which could result in more instances of the above situation popping up.

I propose we change _schema_is_equal to treat the following nullable types interchangably with their non nullable counterparts

  • Integer - IntegerNullable
  • Boolean - BooleanNullable
  • Age - AgeNullable

There are several things to take into account when implementing this:

  • Overall, think about the impact of allowing these types to be used interchangably
  • Consider requiring that we validate the existence of NaNs before treating nullable and non nullable types as equivalent - In general, I don't want us to shy away from keeping IntegerNullable columns as such even if no nans are present (whether we impute them ourselves or users input them). Those types aren't really meant to imply the presence of nans, just that the type can support null values, but other than that they're the same as non nullable integers. For example, in Featuretools, we might output types as IntegerNullable from a Primitive so that users could pass nans in and not have it break.
  • Increase test coverage of the different ways data could be different between train and test data for the different problem types and the score, predict, and transform_all_but_final methods on pipelines
  • Confirm this doesn't change automl results before or after nullable type handling changes

tamargrey avatar Mar 14 '23 20:03 tamargrey