pandas PDEP-13: The pandas Logical Type System

@pandas-dev/pandas-core

Apr 27 '24 16:04 WillAyd

Some big picture thoughts before I call it a day:

Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.
The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases (xref #58214). Are there cases other than series.dt.date where this comes up? If not, drop that example to avoid getting bogged down.
It would help to clarify the relationship between what you have in mind and ExtensionDtype.
The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.
IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64). a) The easy part of this would be mapping the pd.int64(...) to an existing dtype. b) The hard part would be everything else i) does obj.dtype give an "old" dtype or something new? what about series.array? Does this entail another abstraction layer? ii) efficient/idiomatic checks for e.g. pd.int64? iii) Depending on how this relates to ExtensionDtype, does construct_array_type need to become a non-classmethod?

Apr 28 '24 22:04 jbrockmendel

Cool thanks again for the thorough feedback. Responding to each question...

Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see https://github.com/pandas-dev/pandas/issues/58312

2. The "users shouldn't care what backend they have" theme really only works when behaviors are identical, which is not the case in most of the relevant cases

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

Definitely - noted for next revision. But at a high level I am of the opinion that logical types have nothing to do with data buffers, null masks, or algorithm implementations. Continuing with the database analogy, a logical query plan would say things like "Aggregate these types" or "Join these types". A physical query plan by contrast would say "use this algorithm for summation" or "use this algorithm for equality". An extension type acts more like the latter case

4. The physical_type and missing_value_marker keywords/attributes seem more distracting than useful at this stage.

I certainly dislike these too, but I'm not sure how our current type system could be ported to a logical type system without them. With our many string implementations being the main motivating case, I think this metadata is required for a logical type instance to expose so other internal methods know what they are dealing with. I would definitely at least love for these to be private, but it comes down to how much control we want to give users to pick "string" versus "string[pyarrow]" versus "string[pyarrow_python]" and whatever other iterations they can use today

5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64)

To be clear I would sincerely hope a user specifying the backend is an exceptional case; that is only allowed because our current type system allows / quasi-expects it. The logical type is a stepping stone to untangle user requirements from implementation details

Apr 29 '24 00:04 WillAyd

This is a point we disagree on. I think users necessarily have to get bogged down in the details of the different backends today because of how prevalent they are in our API, but that is not a value added activity for end users. I'm thinking of a database where I go in and just declare columns as INTEGER or DOUBLE PRECISION. I imagine that databases over time have managed different implementations of those types but I am not aware of one where you as an end user would be responsible for picking that

In the database example do the different implementations have different behaviors? It strikes me as ~~insane~~silly to say users don't need to know/think/care about things that affect behavior.

Apr 29 '24 02:04 jbrockmendel

Trim out as much as you can. e.g. I think the Bridging section is more or less orthogonal to everything else and could be suggested separately.

Noted and yea I was thinking this as well as I was drafting this. Let's see what other team members think though - it is useful to clarify how the different "backends" should interact, and we have gotten that as a user question before - see #58312

I agree there are a few parts of the Bridging section that are probably not needed and can be trimmed (eg the fact about following the C standard rules on casting / conversion), but on the other hand the text the will also have to be expanded significantly in other areas. For example all the questions Brock listed in the last item in his comment above (https://github.com/pandas-dev/pandas/pull/58455#issuecomment-2081685191) will need answering. Same for capturing a summary of some of the discussion points we have been talking about (we already generated a lot comments in two days, we can't expect everyone to read all that to have a good idea of the discussion)

In the database example do the different implementations have different behaviors? It strikes me as insanesilly to say users don't need to know/think/care about things that affect behavior.

Setting the different null handling aside for a moment (because we know that is a big difference right now, but one we might want to do something about), I think that most reports about different behaviour should be considered as bug reports / limitations of the current implementation, and not inherent differences.
Some examples that have been referenced in the discussion (https://github.com/pandas-dev/pandas/issues/58321, https://github.com/pandas-dev/pandas/issues/58307, https://github.com/pandas-dev/pandas/issues/58315, https://github.com/pandas-dev/pandas/issues/53154) all seem to fit in that bucket to me.

(of course there will always be some differences, like small differences because of numeric stability in floating point algorithms, or different capitalization of the German ß (https://github.com/apache/arrow/issues/34599), and it will always be a somewhat subjective line between implementation details versus clear and intentional behavioural differences. But the list of items linked above are all clear things that could be fixed on our side to ensure we provide a consistent behaviour, not inherent differences. It's our choice whether we want to provide a consistent interface or not)

Apr 29 '24 12:04 jorisvandenbossche

I recognize this is more about the underlying primitive field types than collection types, however I just feel like the collections are arguably also primitive, and this issue has bugged me forever:

To avoid needing to store critical reminders about array axis names and bounds in comments instead of the type hints, how could we include array shape hints (int bounds), and optional array axis names (str) or ideally axis types (convenient newtype int bounds, great for language server to catch shape bugs)?

I've found it quite helpful to separate the axis names from the bounds, but in the case of typed axes, the axis name lives in the type of the bound, so the type checker can also do broadcast alignment. It's a bit tricky in some languages but probably a breeze in python and the only reason we're not doing this is we don't know what we're missing,

imho we want the n-dimensional array shapes to be heterogeneous lists of axis bound pointer newtypes because then the type system can calculate the output shapes of our n-dimensional array operations better, as the world is moving rapidly into higher-dimensional data engineering, just seems reasonable to ensure Pandas has great n-dimensional array shape types, that is all

Jun 11 '24 21:06 bionicles

Hey @bionicles thanks for providing some feedback. I don't quite follow what you are looking for though - is there maybe a small example you can provide to clarify?

Jun 11 '24 23:06 WillAyd

Absolutely, great idea, here is some code

A: demonstrates how array shapes are defined by the data passed in, and bugs can be unclear, and the axes have no names

Then we could hack out a basic data structure to hold the names and bounds for the axes in the shapes of the arrays.

from typing import NewType, Type, Tuple, Protocol, Optional
from dataclasses import dataclass, field
import pytest

@dataclass
class Axis:
    "one step along the path of indices to an element within a tensor"
    name: str = "B"
    bound: Optional[int] = 16

    def __post_init__(self):
        if self.bound is not None and self.bound < 0:
            raise ValueError("Axis size must be non-negative")

def test_axis_nonneg():
    with pytest.raises(ValueError):
        Axis(bound=-1)

# We want to write the shapes of the arrays in the type system
Shape = Tuple[Axis, ...]

# Example:
# Rows = Axis("R")
# Columns = Axis("C") # oops, this is a value
# MatrixShape = Tuple[Rows, Columns] # TypeError: Tuple[t0, t1, ...]: each t must be a type. Got Axis(name='R', bound=16).

rows = Axis("R")
columns = Axis("C")
matrix_shape = (rows, columns)

def str_from_shape(shape: Shape) -> str:
    if not shape:
        return "()"
    names = []
    bounds = []
    for a in shape:
        names.append(a.name)
        bounds.append(a.bound)
    return f"Shape({names=}, {bounds=})"

def shape_from_characters_and_bounds(characters: str, bounds: Tuple[Optional[int], ...]) -> Shape:
    "build a shape quickly"
    characters = tuple(characters)
    characters_and_bounds = zip(characters, bounds)
    return tuple(Axis(name=c, bound=b) for c, b in characters_and_bounds)

def shape_from_shapes(a: Shape, b: Shape) -> Shape:
    c = {}
    for axis_in_a in a:
        c[axis_in_a.name] = axis_in_a.bound
    for axis_in_b in b:
        # remember the linear algebra and numpy broadcasting rules
        # if either is 1 or none, any value is OK, take largest
        # however if they are both scalars greater than 1, (m, n)
        # then m == n must hold for compatibility
        if axis_in_b.name not in c:
            c[axis_in_b.name] = axis_in_b.bound
        else:
            lhs_bound = c[axis_in_b.name]
            rhs_bound = axis_in_b.bound
            if (
                (lhs_bound is not None and lhs_bound > 1) and 
                (rhs_bound is not None and rhs_bound > 1) and 
                 (lhs_bound != rhs_bound)
            ):
              message = f"Axis '{axis_in_b.name}' Bounds ({lhs_bound}, {rhs_bound}) Incompatible: \n- lhs={str_from_shape(a)}\n- rhs={str_from_shape(b)}\n"
              raise ValueError(message)
            else:
                if lhs_bound is None and rhs_bound is None:
                    c[axis_in_b.name] = None
                elif lhs_bound is None and rhs_bound is not None:
                    c[axis_in_b.name] = rhs_bound
                elif lhs_bound is not None and rhs_bound is None:
                    c[axis_in_b.name] = lhs_bound
                else:
                    # both are scalars, take largest
                  c[axis_in_b.name] = max(lhs_bound, rhs_bound)
    c_shape = tuple(Axis(name=k, bound=v) for k, v in c.items())
    return c_shape

def compatibility_from_shapes(a: Shape, b: Shape) -> bool:
    try:
        _c = shape_from_shapes(a, b)
        return True
    except ValueError:
        return False

B: demonstrates how clarifying the shape of the arrays can make similar bugs easier to understand

https://colab.research.google.com/drive/1OVkRtwv747jXPI11kfKCGGFHWWtRs3ki?usp=sharing

You could also transpose by axis names instead of their order, which is way more flexible because you could swap axes into desired positions by names regardless of their current order

This is along the line of what i'm looking for, however requires features of newer versions of python and would screw up pandas backwards compatibility https://taoa.io/posts/Shape-typing-numpy-with-pyright-and-variadic-generics

Anyway, then we could have a column that's type hinted to be a specific shape of array, with axes labeled, i bet it looks really cool

"Array[u8, (B, H, W, C), (16, 28, 28, 1)]" is wayyy more clear than "array[int]" and that clarity will help type checkers help us avoid / understand / fix shape bugs which are prevalent in python AI code

Jun 12 '24 12:06 bionicles

"Array[u8, (B, H, W, C), (16, 28, 28, 1)]" is wayyy more clear than "array[int]" and that clarity will help type checkers help us avoid / understand / fix shape bugs which are prevalent in python AI code

From a static type checking perspective, this is really difficult (and maybe impossible) to do, even to track the dtypes within a DataFrame. In pandas-stubs, we do a best effort to track the dtype of a Series in a static way, but a DataFrame can consist of multiple Series, of very different dtype. So you'd need support from the python typing system to allow a variadic type, and even then, as computations are done on a DataFrame, it's hard to track the resulting changes in dtype that can occur.

You're asking for doing something similar with tracking the shape of a DataFrame. I don't think that would be possible. Consider a merge operation that merged 2 DataFrames using an inner join. There's no way to know, from a static typing perspective, the resulting number of rows that come from the merge, because the number of rows is dynamic - it depends on the underlying data.

So I don't think that what you are asking for is possible from a static typing perspective.

Jun 12 '24 14:06 Dr-Irv

Outside of @Dr-Irv comments I also think what you are asking for is not quite in the scope of this PDEP. I would suggest opening a separate issue for discussion about supporting that though

Jun 12 '24 14:06 WillAyd

Ah, exciting, I didn't know about pandas-stubs and I'll check it out. However, that's not what I'm talking about, I have a way to make Df/Ser schemas, not to mention could try pandera, I'm talking about array column specifications. The difference between a scalar and a field.

Are arrays key value stores, and the keys are N-tuples of indices within the bounds? If so, could it naturally fall within the logical spirit of n-dimensional arrays to carry n-tuples of information about the n dimensions, otherwise it's vague about what's in there?

Sometimes I think the shape details are good, other times we don't know a bound as you mentioned, so we could write None for unbounded or just leave out the shapes entirely, it's fine, but if we do have axis names and None bounds, then we can use the axis names to look up the runtime bounds at later if they change,

i.e. Arrays are like highly efficient versions of these with keys implied by the memory layout:

FloatMatrix = Dict[(int, int), float]

(B, T, H, W, D, C) = create_axis_newtypes("BTHWDC") # these are all int
BoolSpatialVideoBatch = Dict[(B, T, H, W, D, C), bool]

We could think of arrays as multi-index dataframes. If somebody didn't know about (B, T, H, W, D, C) and just used int, it still works the same, but I just think the semantic naming of axes, makes it easier to avoid breaking changes if shapes get transposed here and there, control the inter-dimensional mixing operations, definitely don't want to screw those up! This is useful for einsum, for example.

I guess I'm asking for a glorified tuple of tuples to define abstract array features Screenshot 2024-06-12 134805

=== Not sure what to write about this but here's how i'd tackle your challenges

You're asking for doing something similar with tracking the shape of a DataFrame. I don't think that would be possible. Consider a merge operation that merged 2 DataFrames using an inner join. There's no way to know, from a static typing perspective, the resulting number of rows that come from the merge, because the number of rows is dynamic - it depends on the underlying data.

If you concatenate two dataframes, then you can build a lazy expression of their runtime bounds and evaluate it to get the number of rows after the concatenation of two dataframes, but only if the dataframes have shapes on them. Luckily, dataframes already have shapes on them, so we don't have to do anything to get that data, just build the expression. e.g.

type NumRowsAfterConcat = 
      Add<<A as Bounded<Rows>>::Bound, <B as Bounded<Rows>>::Bound>::Output

impossible

It is possible, here is static array types passing compile failure tests in Rust, with typenum, it's quite unreadable, sadly, in the future when const generics stabilize we can use arabic numerals instead of hlist of binary Screenshot 2024-06-12 113310

Take one look at that screenshot and you'll know why I'm here asking you folks to dumb it down for us!

out of scope

that might mean pandas type system suffers with bad array types for a long time, so I argue it's definitely in scope

Jun 12 '24 19:06 bionicles

If you concatenate two dataframes, then you can build a lazy expression of their runtime bounds and evaluate it to get the number of rows after the concatenation of two dataframes, but only if the dataframes have shapes on them.

Let's suppose we did know the shapes in a static typing context. Then I don't see how you'd figure out the shape of a merge operation in a static context, because the shape would be dependent on the underlying data and how well the merge keys matched up.

Jun 12 '24 20:06 Dr-Irv

I'm happy with a runtime solution, that's gonna be much easier, but don't overthink it, it's a loop over two known shapes to add axes to a new shape applying merge rules to each pair of axes with the same name. It's turtles all the way down in that area, and everything is an HList, so there is only one straightforward option to merge shapes, and that is a "type level fold over a chain of hlists."

If you know all the bounds statically (they're all const integers) then you just gotta check some ifs and !=s , but to get around the gotchas you mention, that's why you use a functional interface and build expressions, if the bound is None then you can just wait to call the merge the shapes until runtime.

instead of operating at the level of the shape values, with proposed axis names in the type system, we could operate at a symbolic level of function nodes that calculate the bounds / shapes whenever we want, compile time or runtime

Here's the key dataclass, collection type, and function refactored with better names and some extra comments. The static merge is the comptime version of this, you could code it various ways, I don't want to keep writing too much

@dataclass
class Axis:
    name: str
    bound: Optional[int]

# post_init omitted for brevity..

Shape = Tuple[Axis, ...] # glorified tuple


def str_from_shape(shape: Shape) -> str:
    "format for readability"
    if not shape:
        return "()"
    names = []
    bounds = []
    for axis in shape:
        names.append(axis.name)
        bounds.append(axis.bound)
    return f"Shape({names=}, {bounds=})"


def shape_from_shapes(lhs: Shape, rhs: Shape) -> Shape:
    "a runtime `merge` for Shape"
    new_shape = {} # NEW SHAPE
    for axis_in_lhs in lhs:
        c[axis_in_lhs.name] = axis_in_lhs.bound # ADD AXIS TO NEW SHAPE
    for axis_in_rhs in rhs:
        # remember the linear algebra and numpy broadcasting rules
        # if either is 1 or none, any value is OK, take largest
        # however if they are both scalars greater than 1, (m, n)
        # then m == n must hold for compatibility
        if axis_in_rhs.name not in new_shape:
            new_shape[axis_in_rhs.name] = axis_in_rhs.bound # ADD NOVEL AXES TO NEW SHAPE
        else:
            lhs_bound = new_shape[axis_in_rhs.name]
            rhs_bound = axis_in_rhs.bound
            if ( # a billion bugs for want of a simple check, not saying this is performant, but it is easy
                (lhs_bound is not None and lhs_bound > 1) and 
                (rhs_bound is not None and rhs_bound > 1) and 
                 (lhs_bound != rhs_bound)
            ):
              message = f"Axis '{axis_in_rhs.name}' Bounds ({lhs_bound}, {rhs_bound}) Incompatible: \n- lhs={str_from_shape(lhs)}\n- rhs={str_from_shape(rhs)}\n"
              raise ValueError(message)
            else: # the bounds for this axis name are compatible on this branch
# therefore, ADD AXIS TO NEW SHAPE
                if lhs_bound is None and rhs_bound is None:
                    new_shape[axis_in_rhs.name] = None
                elif lhs_bound is None and rhs_bound is not None:
                    new_shape[axis_in_rhs.name] = rhs_bound
                elif lhs_bound is not None and rhs_bound is None:
                    new_shape[axis_in_rhs.name] = lhs_bound
                else:
                    # both are scalars, take largest
                  new_shape[axis_in_rhs.name] = max(lhs_bound, rhs_bound)
    new_shape = tuple(Axis(name=k, bound=v) for k, v in new_shape.items())
    return new_shape

Totally works at runtime. Better than the compile time version, imho, because the compile time version needs to be too fancy and weird just to work in a compile time context, it looks way different.

Here's a little diff adding a runtime call to a method get the number of rows in the merged dataframes, so you could see how that looks.

/// this is a type representing either the type-level number of rows if the number of rows for the first and second arrays are known at compile time, otherwise it's going to act as a function to get the number of rows at runtime 

type NumRowsAfterConcat = 
// fix unclear names A, B
-      Add<<A as Bounded<Rows>>::Bound, <B as Bounded<Rows>>::Bound>::Output
+    Add<<FirstShape as Bounded<Rows>>::Bound, <SecondShape as Bounded<Rows>>::Bound>::Output;
+ let num_rows_after_concat = NumRowsAfterConcat::bound();
//                                                  ^^^^ invoke lazy getter on expression output type

In python you could try to do this with protocols, but runtime shapes are fine

Jun 12 '24 22:06 bionicles

out of scope

that might mean pandas type system suffers with bad array types for a long time, so I argue it's definitely in scope

To be clear we inherit our types from a mix of NumPy and Arrow today, neither of which encode the information into their array specifications like which you are asking. Trying to come up with a new standard beyond either of those is not what this PDEP is about, rather how we reasonably wrap pandas functionality on top of those.

What you are proposing definitely has merits but pandas is the wrong library to try and build that into, at least at any meaningful level of detail that works within our ecosystem. The Extension Types offered by Arrow might offer something for you and there is a future where we could reasonably exchange/consume those

Jun 12 '24 22:06 WillAyd

hey great find!

tensor (multidimensional array) stored as Binary values and having serialized metadata indicating the data type and shape of each value. This could be JSON like {'type': 'int8', 'shape': [4, 5]} for a 4x5 cell tensor.

Arrow extension can do exactly what we'd need, I'd just suggest from hard experience debugging, we could benefit if we revise their example separate concerns from "shape" to "axis_names" "axis bounds" bc the axis names are keys (more likely known and stable as shapes change through programs) and the bounds are values (might not be known as Dr Irv mentioned)

shape_info = {'type': 'int8', 'names': ['X', 'Y'], 'bounds': [2, 3]}

a json schema like this could enable an arrow extension for named arrays in pandas

{
  "$schema": "http://json-schema.org/draft-07/schema#",
  "type": "object",
  "properties": {
    "entry": {
      "type": "string",
      "enum": ["int8"]
    },
    "rank": {
      "type": "integer",
      "minimum": 0
    },
    "names": {
      "type": "array",
      "items": {
        "type": "string"
      },
      "minItems": {
        "$data": "/properties/rank"
      },
      "maxItems": {
        "$data": "/properties/rank"
      }
    },
    "bounds": {
      "type": "array",
      "items": {
        "type": "number"
      },
      "minItems": {
        "$data": "/properties/rank"
      },
      "maxItems": {
        "$data": "/properties/rank"
      }
    }
  },
  "required": ["entry", "rank", "names", "bounds"]
}

My blocker with the pandas type system always comes down to collections (tuples, tensors, and jsonb), and while SQL-esque table normalization is one approach, if we're thinking in terms of what's possible, scalars (base dtypes) are already arrays of unit shape, we could gain easy benefits by treating everything like arrays, you can broadcast all the stuff you have right now over whatever shape you want. Those type hints / axis names are keys to unlock myriad awesome builtin array features in pandas.

The difference between "int" and "[int, (T, H, W, C)]" is the difference between a column of one color of one pixel and a column of entire movies. Please don't take my word for it, go ask around the pandas user base and see who puts / intends to put / is forced to deal with collections (arrays and jsonb) in dataframes and would appreciate type hints for that built into pandas.

wrong library

You may be right, and I'm happy to contribute this code elsewhere like python typing module, but pandas is all about arrays plus axis information, so this PDEP felt like a natural place

for one use case of many for this, just imagine we could all easily promote a cell with 1 thing in it to a vector of things, that's a good way to represent an audit trail / version history

Jun 13 '24 12:06 bionicles

My blocker with the pandas type system always comes down to collections

You may also be interested in the entire Arrow type system, which has better support for these out of the box:

https://arrow.apache.org/docs/format/Columnar.html

The Fixed Size List Layout may have much of what you are after

In the current PDEP I am just proposing a generic ListDtype, but I'm starting to wonder now if we shouldn't offer a FixedSizeListDtype alongside that

Jun 13 '24 12:06 WillAyd

Worth a look:

Julia's native arrays https://docs.julialang.org/en/v1/manual/arrays/#man-multi-dim-arrays
Pandera paper https://www.researchgate.net/publication/343231859_pandera_Statistical_Data_Validation_of_Pandas_Dataframes
RL Gym Box Space https://gymnasium.farama.org/api/spaces/fundamental/#gymnasium.spaces.Box
All lists/vectors/matrices/cubes/etc are special cases of this key value store concept, just with different FixedSizeList for the keys, so that could be a good way to think about this for sure

just keep in mind the word "list" to pythonistas implies rank-1 keys (like single integers) while numpy arrays have a multi-axis slicing/indexing syntax with stride, even, dang

I think with a functional API like this, all the different ranks of tensor including scalar can all implement similar methods, enabling us all to work with a single number / vector / cube of a million numbers the same way, just the single number key is () and the cube keys are (I, J, K) type like (i, j, k) values; the underlying implementation details of how this key-value mapping is achieved don't matter to callers (ideally) as long as they adhere to the function signatures. The performance magic of good array libraries is to pretend like the i,j,k keys exist even though they're just implied by the data memory layout.

we could rewrite Lensable as a python Protocol (https://docs.python.org/3/library/typing.html#typing.Protocol) and ensure all the collections are lensable, then anything which is lensable in this way can work inside the type system here (thus making it easy for library users to add new struct, json, array dtypes by subclassing the Lensable Protocol and implementing these methods for their data structure)

Then another Field Protocol could define the methods you want / need on the base primitive dtype, and the "leaf types" like bool, int, float, would implement both Lensable<(), Self> to quack like collections, and they could implement Field, which would encompass the arithmetic operations track the changes in dtypes at runtime to act like values to be added, subtracted, multiplied, divided, etc,

Basing it on 2-level Lensable/Field Protocol system would mean anybody could add new collection or primitive dtype to the system (open to extension) and the FixedSizeList could be perfect to hold a certain rank # of axes (including zero for scalars). An example of when custom Fields might be useful would be to define quantities and units so you don't mix up pounds and kilograms or meters and feet, miles/km, money units, that kind of thing

Jun 13 '24 14:06 bionicles

@bionicles what you’re describing is worth discussing separately. Please open a dedicated issue.

Jun 13 '24 14:06 jbrockmendel

OK, made it, https://github.com/pandas-dev/pandas/issues/59006 Mentions the Arrow Tensor (could make it easier to support that usecase)

Jun 13 '24 19:06 bionicles

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Jul 15 '24 00:07 github-actions[bot]

OK now that the landscape has changed post PDEP-14 and with PDEP-16 under discussion, I've gone ahead and updated the proposal. Let me know what you think

Aug 01 '24 15:08 WillAyd

It would really help me out to get more depth on questions 3) and 5)i-iii here. Also:

Is the result of ser.array considered a public API that would be expected to behave the same across backends? If so we'd need an additional layer hiding an underlying EA from the user.
To the extent that a major goal is to match behaviors across backends, should we make a policy for what we do as new inconsistencies are discovered? e.g. when will "fixing" an inconsistency be considered a bugfix vs breaking change? Policy for determining that some inconsistencies are not worth fixing?

Aug 06 '24 22:08 jbrockmendel

3. It would help to clarify the relationship between what you have in mind and ExtensionDtype.

In the original draft of this PDEP, I wasn't sure if we felt there would be a reason to keep the ExtensionDtype untouched and do this separately, or to have this basically take over the Extension type system and set expectations around how that works more clearly. We ended up on the latter approach

5. IIUC the main point of the proposal is that pd.int64(family="pyarrow|numpy|masked") is clearer than pd.ArrowDtype(pa.int64()) | pd.Int64Dtype() | np.dtype(np.int64). a) The easy part of this would be mapping the pd.int64(...) to an existing dtype. b) The hard part would be everything else i) does obj.dtype give an "old" dtype or something new? what about series.array? Does this entail another abstraction layer? ii) efficient/idiomatic checks for e.g. pd.int64? iii) Depending on how this relates to ExtensionDtype, does construct_array_type need to become a non-classmethod?

In its current state, the PDEP suggests we just have pd.Int64Dtype(). We can reuse the same type that is there today, but change the underlying implementation of it as we see fit

6. Is the result of ser.array considered a public API that would be expected to behave the same across backends? If so we'd need an additional layer hiding an underlying EA from the user.

That's a good question. We can probably just do away with ser.array altogether?

7. To the extent that a major goal is to match behaviors across backends, should we make a policy for what we do as new inconsistencies are discovered? e.g. when will "fixing" an inconsistency be considered a bugfix vs breaking change? Policy for determining that some inconsistencies are not worth fixing?

I don't think we should put a lot of time into trying to make all the backends work the same; the goal of this PDEP is to really stop caring about all the subtle differences and just have one way of doing things. For most logical types I expect Arrow to be the main storage backing it, so I would suggest we defer to how Arrow handles things

Aug 06 '24 22:08 WillAyd

I don't think we should put a lot of time into trying to make all the backends work the same; the goal of this PDEP is to really stop caring about all the subtle differences and just have one way of doing things.

These two sentences seem in direct contradiction to each other.

[...] For most logical types I expect Arrow to be the main storage backing it, so I would suggest we defer to how Arrow handles things

You're advocating API changes and in the previous sentence saying you don't want to think too much about what those changes would be. I object.

In its current state, the PDEP suggests we just have pd.Int64Dtype(). We can reuse the same type that is there today, but change the underlying implementation of it as we see fit

So would pd.ArrowDtype(pa.int64()) return an instead of pd.Int64Dtype? Or would pd.ArrowDtype go away completely?

We can probably just do away with ser.array altogether?

And .values? Would that be expected to be unchanged across backends or would you get rid of that too?

Aug 06 '24 22:08 jbrockmendel

You're advocating API changes and in the previous sentence saying you don't want to think too much about what those changes would be. I object.

I'm not sure I understand what you are saying here - do you have an example in mind you are worried about?

So would pd.ArrowDtype(pa.int64()) return an instead of pd.Int64Dtype? Or would pd.ArrowDtype go away completely?

All of our existing constructors would be automatically mapped to the logical type, so in this case pd.ArrowDtype(pa.int64()) would give you back a pd.Int64Dtype

And .values? Would that be expected to be unchanged across backends or would you get rid of that too?

Yea that's a good callout. .values will be much less useful in the world of this PDEP. If a user specifically wanted a NumPy array back they should call .to_numpy(), and the logical type can do that conversion (or raise, if it doesn't make sense)

Aug 06 '24 23:08 WillAyd

I'm not sure I understand what you are saying here - do you have an example in mind you are worried about?

I'm not inclined to spend more time on this. There's a good idea in here somewhere and I'd like to get to +1, but I'm -1 until you actually think through the implications of what you're asking for.

Aug 07 '24 01:08 jbrockmendel

I don't think we should put a lot of time into trying to make all the backends work the same; the goal of this PDEP is to really stop caring about all the subtle differences and just have one way of doing things. For most logical types I expect Arrow to be the main storage backing it, so I would suggest we defer to how Arrow handles things

I think this is part of the "users generally don't care about the subtle differences". Working with pandas in a production setting, I care a great deal. Our code at work has to deal in particular with a lot of NA values, and I imagine would need significant and hard-to-find changes to switch over to Arrow NA semantics. And while I would guess that my uses are in the minority of overall users, I do think there are others who care.

Other subtle differences besides NA semantics are perhaps even more concerning, mostly because I don't know what they are.

This isn't (necessarily) an objection to moving pandas more towards Arrow behavior, only an objection to the idea that users don't care about subtle differences.

Aug 07 '24 21:08 rhshadrach

Thanks @rhshadrach for the feedback

Our code at work has to deal in particular with a lot of NA values, and I imagine would need significant and hard-to-find changes to switch over to Arrow NA semantics

I'm assuming when you say you deal with a lot of NA values that you are not using pd.NA but still np.nan? Or if you are using pd.NA can you clarify why you mean with the concern around Arrow NA semantics?

FWIW missing value semantics are more of a topic for PDEP-16 than this PDEP, although it can be hard to untangle that given how our types have historically been so intertwined with a missing value marker

Aug 07 '24 21:08 WillAyd

I'm assuming when you say you deal with a lot of NA values that you are not using pd.NA but still np.nan? Or if you are using pd.NA can you clarify why you mean with the concern around Arrow NA semantics?

np.nan as well as None in object dtypes.

FWIW missing value semantics are more of a topic for PDEP-16 than this PDEP, although it can be hard to untangle that given how our types have historically been so intertwined with a missing value marker

While I used the example of NA semantics, the point of my previous comment is to state that there are users that care a great deal about subtle differences. I do think that's relevant to this discussion.

Aug 07 '24 22:08 rhshadrach

If you were to ignore NA handling do you have an example of where you think the API for types should diverge based off of their implementation?

The example we've talked about a few times in this discussion is pd.Series.dt.date returning an object dtype for pd.DatetimeDtype but returning a pa.date32 when using a pa.timestamp for storage. Is your concern that if we were to change something like that to always just return a pd.Date data type that there would have to be a transition, or do you just not think its worth doing that in the first place?

Aug 07 '24 22:08 WillAyd

If you were to ignore NA handling do you have an example of where you think the API for types should diverge based off of their implementation?

pyarrow durations can hold int64_min while np.timedelta64s can't. on the flip side, you can multiply numpy ones by floats, but not pyarrow ones. Particularly in the former case it is a tiny corner case that would take enormous effort to get the behaviors to match. In the latter case the importance depends on PDEP-16 bc if nan is distinct from NA then Timedelta(1)*np.nan is distinct from NA, and the pyarrow durations don't have a way to represent NaT.

pyarrow/numpy numeric dtypes handle overflows differently.

I assume that we will continue to find more small differences forever. We can either say "users don't need to know/care which backend they have" or "we won't worry about making the behaviors match across backends", but we cannot say both. (this is why i dislike the term backend xref #58214).

Aug 07 '24 22:08 jbrockmendel