woodwork icon indicating copy to clipboard operation
woodwork copied to clipboard

Move dtype/physical type property off of the LogicalType class

Open tamargrey opened this issue 3 years ago • 1 comments

As a user, I would like to be able to use the same Woodwork LogicalType independent of whether or not my data may contain null values.

Currently, a column's physical type (and, correspondingly, the Series' dtype) is determined by the LogicalType being used for that column. This means that an Integer logical type column will always have the non-nullable int64 dtype, and an IntegerNullable column will always have Int64 as its dtype, independent of whether or not there is null values in a column.

The way this is done is by the primary_dtype on the LogicalType class. (Note - a backup-dtype is used in instances where a library does not support the logical type's primary dtype). The actual dtype that's used for a column in a DataFrame is determined by LogicalType._get_valid_dtype, which requires a Series to determine.

As logical types represent additional meaning beyond what a physical type can provide, the fact that they're tied to a single physical type may limit their usability. We should consider making a column's physical type (and therefore its dtype) a more flexible attribute of a Woodwork object by no longer tying it to a LogicalType object. One solution may be to add physical_type as a property of a ColumnSchema or a WoodworkColumnAccessor object.

Here are some questions that would need to be answered to implement this:

  • Should a ColumnSchema object, which has no knowledge of the actual data, store the physical type, a type that is entirely reliant on how the data is physically stored?
  • Can any dtype be specified for any column independent of its LogicalType, or is there still some set of possible dtypes defined for each logical type?
  • How are users specifying which physical types they want? Is it at init by logical type? By column?
  • If nullable and non nullable dtypes are possible for the same logical type, which is used at type inference?

tamargrey avatar Aug 30 '21 19:08 tamargrey

Iceboxing for now, until investigation and design doc is completed here:

  • https://github.com/alteryx/featuretools/issues/1686

gsheni avatar Nov 18 '21 22:11 gsheni