woodwork
woodwork copied to clipboard
Move dtype/physical type property off of the LogicalType class
As a user, I would like to be able to use the same Woodwork LogicalType independent of whether or not my data may contain null values.
Currently, a column's physical type (and, correspondingly, the Series' dtype) is determined by the LogicalType being used for that column. This means that an Integer
logical type column will always have the non-nullable int64
dtype, and an IntegerNullable
column will always have Int64
as its dtype, independent of whether or not there is null values in a column.
The way this is done is by the primary_dtype
on the LogicalType class. (Note - a backup-dtype
is used in instances where a library does not support the logical type's primary dtype). The actual dtype that's used for a column in a DataFrame is determined by LogicalType._get_valid_dtype
, which requires a Series to determine.
As logical types represent additional meaning beyond what a physical type can provide, the fact that they're tied to a single physical type may limit their usability. We should consider making a column's physical type (and therefore its dtype) a more flexible attribute of a Woodwork object by no longer tying it to a LogicalType
object. One solution may be to add physical_type
as a property of a ColumnSchema
or a WoodworkColumnAccessor
object.
Here are some questions that would need to be answered to implement this:
- Should a
ColumnSchema
object, which has no knowledge of the actual data, store the physical type, a type that is entirely reliant on how the data is physically stored? - Can any dtype be specified for any column independent of its LogicalType, or is there still some set of possible dtypes defined for each logical type?
- How are users specifying which physical types they want? Is it at init by logical type? By column?
- If nullable and non nullable dtypes are possible for the same logical type, which is used at type inference?
Iceboxing for now, until investigation and design doc is completed here:
- https://github.com/alteryx/featuretools/issues/1686