pandas PDEP-14: Dedicated string data type for pandas 3.0

Following the discussion in https://github.com/pandas-dev/pandas/issues/57073, this proposes a possible solution to get a string dtype in pandas 3.0 (essentially writing out my compromise attempt at https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2080683080 as a formal proposal). This also covers the issue tracking the required work for the string dtype in https://github.com/pandas-dev/pandas/issues/54792.

Abstract

This PDEP proposes to introduce a dedicated string dtype that will be used by default in pandas 3.0:

In pandas 3.0, enable a string dtype ("str") by default, using PyArrow if available or otherwise the numpy object-dtype alternative.
The default string dtype will use missing value semantics using NaN consistent with the other default data types.

This will give users a long-awaited proper string dtype for 3.0, while 1) not (yet) making PyArrow a hard dependency, but still a dependency used by default, and 2) leaving room for future improvements (different missing value semantics, using NumPy 2.0 or nanoarrow, etc).

Sub-discussions:

https://github.com/pandas-dev/pandas/issues/58613

cc @pandas-dev/pandas-core @pandas-dev/pandas-triage

May 03 '24 15:05 jorisvandenbossche

ValueError: Could not find PDEP number in 'PDEP: Dedicated string data type for pandas 3.0'. Please make sure to write the title as: 'PDEP-num: PDEP: Dedicated string data type for pandas 3.0'.

May 03 '24 21:05 jbrockmendel

@jorisvandenbossche - I've renamed this PDEP-14 to fix the doc build job. The docs build automatically picks up added PDEP PRs for the website, and they need a number for that to succeed.

May 04 '24 13:05 rhshadrach

One of the concrete discussion points is the API design of the StringDtype(..) constructor and the way to distinguish the various variants of the dtype (i.e. the current "pyarrow_numpy" naming we introduced in https://github.com/pandas-dev/pandas/pull/54533 / https://github.com/pandas-dev/pandas/issues/54792).
To keep that sub-discussion manageable, I opened a dedicated issue for that specific topic: https://github.com/pandas-dev/pandas/issues/58613

May 07 '24 11:05 jorisvandenbossche

I'm with Joris pretty much across the board on this. I'm pretty sure @phofl will be too.

May 07 '24 14:05 jbrockmendel

Thanks Brock. It would indeed be good to hear from others that previously seemed to be OK with the compromise and the NaN behaviour we currently have on main (or not OK, of course, in that case you are also allowed to speak up ;))

May 07 '24 19:05 jorisvandenbossche

@pandas-dev/pandas-core I pushed a set of updates based on the discussions from last week:

Expanded the section on "Missing value semantics" to more clearly contrast the behaviour differences between what is being proposed and the existing StringDtype (as it seems that keeps causing some confusion)
Updated the naming to use StringDtype() with a combo of storage and na_value keywords (so replacing the confusing storage="pyarrow_numpy" name with storage="pyarrow", na_value=pd.NA). This is based on the proposal I did in https://github.com/pandas-dev/pandas/issues/58613#issuecomment-2100453677, we can keep discussing that aspect there.
Expanded the backwards compatibility section to more explicitly call out the valid backwards compatibility issues for existing users of dtype="string"/dtype=pd.StringDtype(). Although I know this is one of the contentious parts, I am still proposing to introduce breaking changes for those users (I don't think there is any way around this given the proposal of using NaN semantics for the default string dtype), but based on the feedback I added the proposal to add a deprecation warning for this in advance. So at least it's a breaking change for which we warn in advance and for which we provide am easy option to suppress the warning / preserve the current behaviour with minimal code edits required.

My understanding is that the main points of contention currently are:

The notion that the default string dtype should use NaN semantics to be consistent with the other default dtypes, as proposed by the PDEP (and thus also return default numeric/bool dtypes in string operations, in contrast to the nullable/masked Integer/Float/Boolean dtype being returned by the current StringDtype variant that uses pd.NA).
The fact that letting the implicit dtype="string" / dtype=pd.StringDtype() become an alias for the new default string dtype (NaN-variant), while it currently already gives the NA-variant, means this part is a breaking change for the existing users of StringDtype (although we can add a deprecation warning for it in advance, it's of course still a change in behaviour in 3.0 for those users)
Related to the above points, but the notion that we should introduce this change (a default string dtype) now for pandas 3.0, instead of waiting for pandas 4.0 when we (hopefully) have a better idea about using pd.NA more generally / about a general logical dtype system (instead of already doing something like that but just for the string dtype), which could avoid both points above.

While I firmly believe for the first bullet point that this is the only viable option at this point (IMO we really don't want to give users mixed NaN/NA semantics as their default user experience), I think those points are further mostly subjective judgement calls about whether the added complexity (having yet another variant of StringDtype to be able to make this a default dtype right now) and breaking changes (for existing users of StringDtype) are worth it compared to the added benefit of having a dedicated (and potentially much faster) string dtype by default sooner rather than later.

May 13 '24 13:05 jorisvandenbossche

@Dr-Irv thanks for the detailed review! Merged most of the suggestions, and answering the remaining comments

May 13 '24 15:05 jorisvandenbossche

If one was to argue that the users that benefit most from a dedicated string dtype are already aware of the "experimental" string dtype that has been pyarrow backed for almost 3 years, then one could also argue that there is probably not the benefits to users and urgency that this PDEP initially proposed. (Only a breaking change for those that do already benefit)

Also, given the evolution of this discussion and updates to this PDEP, I think that my comment in #57073 about not being ready has probably gained some traction since this proposal now suggests further releases before 3.0.

However, I was also under the impression that the intention of the solution as initially proposed was to help get the 3.0 released unblocked if the PyArrow dependency requirement was dropped. If we approve this PDEP with the modifications and that results in pandas 3.0 being released much later than planned, we all the time move closer to the point where the NumPy native string solution may become a usable solution (and use the time to enhance the pandas I/O and 2d EA interface to better support it)

Assuming that I could safely say that nobody really likes any fallback solution for either performance, consistency, complexity, confusion or maintenance reasons then we should probably include in this PDEP the deprecation plan for the fallback. As I mentioned in https://github.com/pandas-dev/pandas/issues/57073#issuecomment-2092608825, which is expected to happen first, having PyArrow as a required dependency or having the minimum version of NumPy as 2.0?

May 13 '24 16:05 simonjayhawkins

I'm with Joris pretty much across the board on this. I'm pretty sure @phofl will be too.

Agreed with @jbrockmendel

This would be an acceptable compromise of not requiring PyArrow to me

May 13 '24 19:05 phofl

Thanks all for the feedback. Pushed another update with minor text updates addressing some comments, and specifically added the suggestion to add a capitalized "String" alias to make the change for users that want to keep using the NA-variant smaller (dtype="string" to dtype="String" instead of dtype="pd.StringDtype(na_value=pd.NA)"), and that indeed makes it consistent with how we capitalize the string aliases for other nullable dtypes as well at the moment.

Happy to discuss nullable dtypes by default for the 2025 major release. Though in particular I'm curious about your (Joris') thoughts on whether you'd eventually be happy making PyArrow required if PyArrow dtypes were the default wherever possible

@MarcoGorelli Interesting question, but let's leave that for another thread to answer that ;) (this one is already long enough)

@simonjayhawkins summary of my response to your comment (https://github.com/pandas-dev/pandas/pull/58551#issuecomment-2108192398):

I don't think there good reason to believe most users (that would benefit from it) already use the string dtype, especially not for the pyarrow-backed version.
I don't think we are even considering making numpy 2.0 a requirement for pandas 3.0, so any more concrete discussion related to that is out of scope for this PDEP (see also my answer to Kevin's comment from 2 weeks ago: https://github.com/pandas-dev/pandas/pull/58551#discussion_r1589427303)

Will put my more detailed response I started to write up in a collapsed section, to reduce the wall of text a bit when scrolling through this PR.

If one was to argue that the users that benefit most from a dedicated string dtype are already aware of the "experimental" string dtype that has been pyarrow backed for almost 3 years

I don't think this would be true. I also don't have any concrete data to back this up, but I would argue that most likely a majority of users that would benefit from a dedicated string dtype is not already using it (and especially not the pyarrow-backed version, as you need to additionally opt-in to that beyond dtype="string").

So I personally do think that this proposal will give a significant benefit for many users not already using the dtype.

Some partial data points: on StackOverflow, searching for "StringDtype" with tag pandas, there is no explicit usage of the "pyarrow" storage, but only of "string" or StringDtype(). Searching explicitly for "string[pyarrow]" does not give that many relevant results. And when searching generally for "string dtype", the most relevant or most viewed questions again don't mention pyarrow. (it might be interesting to do a search on eg kaggle notebooks)

However, I was also under the impression that the intention of the solution as initially proposed was to help get the 3.0 released unblocked if the PyArrow dependency requirement was dropped.

That is still the intention.

If we approve this PDEP with the modifications and that results in pandas 3.0 being released much later than planned

Pandas 3.0 will be released later than planned regardless, as it was originally planned for last month .. But I still think that with the current modifications (i.e. mostly adding a keyword to the StringDtype constructor), we can have an RC for 3.0 by the end of June, if there is agreement on the proposed changes.

move closer to the point where the NumPy native string solution may become a usable solution

Regardless of pandas 3.0 being released now or in a few months, I don't think we are ready (or should consider) to require numpy 2.0 for pandas 3.0. If we want to add another variant of the dtype using numpy's 2.0 string dtype under the hood, that is perfectly fine, and this PDEP does not preclude any of that. Neither does it say we should not, at some point after pandas 3.0, switch the object-dtype based fallback to the numpy-2.0 string dtype based fallback (when we are ready to require numpy 2.0 and feel the numpy string dtype is stable enough). See also my answer to Kevin from last week: https://github.com/pandas-dev/pandas/pull/58551#discussion_r1589427303

Assuming that I could safely say that nobody really likes any fallback solution for either performance, consistency, complexity, confusion or maintenance reasons then we should probably include in this PDEP the deprecation plan for the fallback.

Whether the fallback for the pyarrow-based string dtype uses numpy object dtype or numpy-2.0 string dtype, that does not change anything regarding consistency or confusion for users (it only helps for performance). And given the object-dtype based implementation has existed for many years, I would say that in terms of code complexity and short term maintenance for us, the object-dtype based one is far easier than a new dtype using numpy-2.0 strings under the hood.

which is expected to happen first, having PyArrow as a required dependency or having the minimum version of NumPy as 2.0?

This PDEP does not care about that. The only thing the PDEP describes is a proposal about if we want to have a default string dtype for pandas 3.0 on the relatively short term (like, this year) and do not (yet) want to require pyarrow as a hard dependency (for pandas 3.0). If we want to require pyarrow as a hard dependency or numpy 2.0 in a later pandas release, that is totally up for debate, but out of scope for this PDEP.

May 20 '24 13:05 jorisvandenbossche

For the backwards compatibility for existing users of dtype="string", what would people think about providing a specific option that those users can enable to keep using the NA-variant of the string dtype by default? (something that came up in the dev call today)

Assume this is something like pd.options.mode.use_string_dtype_with_na = True (exact name to bike-shed, but first want to see if this would be helpful). Users that are already using the current StringDtype with NA and that would like to keep using that instead of the future default, could enable this option without otherwise having to update their code to specifically choose this variant (i.e. don't have to update their code to change dtype="string" to dtype="String").

That would make updating to pandas 3.0 for those users potentially easier. Although, note that this options would also not preserve the current behaviour exactly. Right now, such users only get StringDtype where explicitly asking for it (by specifying dtype="string", or by calling df.convert_dtypes(), or specifying dtype_backend in an IO method), while in pandas 3.0 they would also get the NA-variant everywhere we infer a string dtype (and where you currently still get object dtype).

On the other hand, that is yet another (and very specific) option, while the code changes required for dtype="string" to dtype="String" is also not that big.

This also came up before in this thread, see https://github.com/pandas-dev/pandas/pull/58551#discussion_r1600405956, but in context of providing a general option to opt-in for NA-dtypes (not specific to string dtype). While we certainly will want such global option at some point, the main problem is that we are not ready to provide a full implementation of such option right now (e.g. the constructors are not yet set up to infer the nullable numeric/bool dtypes, and also not all dtypes would then follow this option), and it will probably be confusing the have an publicized option that is only partially implemented.

May 22 '24 20:05 jorisvandenbossche

I don't think an option with partially implemented features like that is a good idea. Our type system in its current state is an endless array of aliases / types with diverging behavior; I don't think adding more aliases and options makes things any easier, just kind of shuffles the problems around.

I still generally don't understand why we feel the need to break backwards compatability with dtype="string"; this PDEP can achieve its objective without doing that and introducing new aliases if it just repurposes the dtype=str constructor. We already have a difference between dtype=str and dtype="string" today, I don't see the value in adding yet another dtype="String" while changing the behavior of dtype="string"

May 23 '24 18:05 WillAyd

I don't think an option with partially implemented features like that is a good idea.

The idea behind the option is that if you have 2.x code that uses pd.StringDtype() or ”string”, then 3.0 code with that option turned on would work as it does today, i.e. it would be as “partially implemented” as it is today and you’d only need a one-line change to existing code to retain behavior.

May 24 '24 11:05 Dr-Irv

this PDEP can achieve its objective without doing that and introducing new aliases if it just repurposes the dtype=str constructor.

We can indeed document and use dtype=str as the way to specify the default string dtype (we should allow this anyway, also under the current proposal), and that would indeed reduce the backwards incompatible changes quite a bit. The reasons this would not be my preferred solution to only use it and not "string" for the default dtype:

If we use dtype=str | dtype="str" for the default NaN-variant and keep dtype="string" for the NA-variant, then I think also the string representation of the dtype should be "str" (e.g. what we show in the output of df.dtypes or the repr of a series). Because otherwise if users see string as the dtype description, they would rightfully expect they can do dtype="string" to get that dtype.
That means that users can see both "str" and "string" as dtype descriptions, and personally I think that explaining "string" vs "String" is easier than "str" vs "string" (because it is more consistent with "int64" vs "Int64")
For people that use the explicit constructor and not the string alias, i.e. pd.StringDtype(), this would still be backwards incompatible, because without any arguments it should IMO still give the default dtype. I suppose dtype="string" is used quite a bit more than dtype=pd.StringDtype(), so only keeping dtype="string" backwards compatible would already help a lot, but it is one more inconsistency to explain.

To be honest, I don't think this are necessarily very strong arguments that I am giving, and more coming down to preferences (I think "string" is the better name, so would prefer to have that for the default dtype that most users will see), at which point the trade-off with the back compat issues is maybe harder to justify. So if others would also prefer or be on board with going for dtype="str" for the default dtype, I could certainly go along with that as well.

May 24 '24 14:05 jorisvandenbossche

If we use dtype=str | dtype="str" for the default NaN-variant and keep dtype="string" for the NA-variant, then I think also the string representation of the dtype should be "str" (e.g. what we show in the output of df.dtypes or the repr of a series). Because otherwise if users see string as the dtype description, they would rightfully expect they can do dtype="string" to get that dtype

Definitely agree on this point - in our current release I find it confusing that the repr shows dtype="string" but .dtype returns "string[pyarrow_numpy]".

That means that users can see both "str" and "string" as dtype descriptions, and personally I think that explaining "string" vs "String" is easier than "str" vs "string" (because it is more consistent with "int64" vs "Int64")

Definitely understand this argument, but in the current PDEP design there is an inconsistency anyway between dtype=int and dtype=float actually returning int/float types whereas dtype=str does not return a string, and then this PDEP also breaks the pd.IntDtype(), pd.FloatDtype, pd.StringDtype() NA consistency

With respect to capitalization, the semantics of that are not going to scale well over time so I'm hesistant to put more overloaded meaning into that. Particularly as we think about adding first class support for aggregate types - should List[string] work the same as List[String] or should the former not be allowed? Are we going to bother with a list[string] or list[String] at all?

For people that use the explicit constructor and not the string alias, i.e. pd.StringDtype(), this would still be backwards incompatible, because without any arguments it should IMO still give the default dtype. I suppose dtype="string" is used quite a bit more than dtype=pd.StringDtype(), so only keeping dtype="string" backwards compatible would already help a lot, but it is one more inconsistency to explain.

I was still hoping that we wouldn't change the pd.StringDtype() constructor either - is that a hard requirement?

May 24 '24 14:05 WillAyd

Definitely agree on this point - in our current release I find it confusing that the repr shows dtype="string" but .dtype returns "string[pyarrow_numpy]".

The original thinking here was to make code portable. A user would write out their string data (either with the PyArrow or NumPy object backend) with the dtype string and could be read in whether the data receiver had PyArrow installed or not.

So this was in keeping with the idea that the PyArrow backend was an implementation detail and that the api and behavior of the object backed and Pyarrow back string arrays should be identical and interchagable.

However, with the advent of the ArrowExtensionArray using string[pyarrow] as the repr (to be consistent with the other Arrow types) this now adds to the confusion.

May 31 '24 12:05 simonjayhawkins

I was still hoping that we wouldn't change the pd.StringDtype() constructor either - is that a hard requirement?

My thinking here was that if we provide a StringDtype() constructor that is used for the default dtype, then the "default" call to it (without any arguments) should ideally give you the default dtype. Of course, we could just not document pd.StringDtype() at all for the default dtype (and only point users to dtype=str or dtype="str" for specifying the default string dtype), and keep pd.StringDtype() (as it is documented) mainly for opt-in NA-variant of the dtype. In practice we would still need pd.StringDtype(storage="python"|"pyarrow", na_value=np.nan) for testing, but if it is only for testing, it is maybe fine that those arguments are not the default (although not ideal, because it will leak into user code at some point I think).

What are other people's thoughts on using "str" and "string" instead of "string" and "String" as the string aliases for the dtype (for the NaN and NA variant, respectively) ?

Jun 04 '24 16:06 jorisvandenbossche

What are other people's thoughts on using "str" and "string" instead of "string" and "String" as the string aliases for the dtype (for the NaN and NA variant, respectively) ?

"str"/"string" seems much worse confusion-wise than "string"/"String".

Jun 04 '24 17:06 jbrockmendel

Just to clarify, I only ever suggested for dtype=str to map to the new type, since that is an existing valid construction that has np.nan nullability semantics. Changing dtype=str improves existing code without breaking dtype="string", and still from an end user signals intent that they want a string data type. Continuing to map that to object when we have a more proper string implementation doesn't make sense to me. I assume dtype="str" is a far less common construction so no strong opinion on that, but I would think that also signals intent that you don't want object

Our type aliases are already a mess...the more we change the worse off we will be. Having to teach someone that "Int", "Float" and "string" provide NA before 3.x, but "Int", "Float" and "String" are required during the 3.x series, possibly reverting back to the old behavior in a future release is really confusing. Then to say IntDtype, FloatDtype, and StringDtype provided NA behavior up until 3.x but then StringDtype() changed back to np.nan doubles down on that.

Going through all this API churn is not value added to users, and is super confusing

Jun 04 '24 18:06 WillAyd

Going through all this API churn is not value added to users, and is super confusing

I agree with Will. Based on the above discussion, here's a proposal that I think is a compromise, and which probably has warts that people will shut down:

Keep dtype="string" the same as in 2.x, i.e., using pd.NA. Add in dtype="String" for symmetry with "int"/"Int", "float"/"Float", but it is equivalent to dtype="string" today. Announce that "string" and "String" will be deprecated in a future release (or we could just skip creating dtype="String")
Create dtype="str" and dtype=str to use np.NaN semantics with pyarrow if installed, otherwise python strings if not.
Create dtype="Str" to use pd.NA, and uses pyarrow if installed, otherwise python strings if not.
Keep StringDtype() as it is today - no change in the API, but announce it will be deprecated in a future release.
Create StrDtype() that has all the controls for specifying pyarrow, python, np.NaN and pd.NA

What this results in is an API that has any 2.x code that works as it does today - no changes needed on users for 3.0. BUT they are given a deprecation warning indicating that they have to use "str" or StrDtype() in a future release.

Anyone wanting the new behavior uses "str"/str or StrDtype().

Note that the naming of StrDtype() and using str matches IntDtype() and int, FloatDtype() and float, i.e., we are using the python name for the type.

Any "default" behavior (e.g., in I/O readers, inferring dtypes) would use "str" not "string"

Net result is that we remove the word "string" from the current vocabulary, and replace it with "str" because we deprecate the word "string"

Jun 04 '24 21:06 Dr-Irv

I don't think we need to move long term to use "str" instead of "string", or at least we don't have to decide that right now. So if we go for "str" as the string alias for the NaN-variant of the dtype right now, and keep "string" for the NA-variant, then at the point in the future where the NA-variant might become the default, we can still decide then whether we want to keep using "string" for the dtype repr (and make "str" just an alias of that) or the other way around (use "str" for the repr and make "string" an alias).

3. Create dtype="Str" to use pd.NA, and uses pyarrow if installed, otherwise python strings if not.

I don't think there is a need to already introduce another name. People have been using "string" and they can continue to do that for now if they want the NA variant? Why would we cause the code churn of using a different name?

4. Keep StringDtype() as it is today - no change in the API, but announce it will be deprecated in a future release.

Given you can't yet (or should not yet) act on that deprecation, and we are also not yet certain about how the transition to NA dtypes will look like exactly, I am not sure there is a good reason to already make announcements related to that.

I am starting to get convinced that using "str" instead of "string" for the new default dtype would be a good idea to help the backwards compatibility story, but then I would not go any further than that and just leave it at those two names (and not add other new aliases like "String" or "Str")

Jun 04 '24 22:06 jorisvandenbossche

Continuing to map (dtype=str) that to object when we have a more proper string implementation doesn't make sense to me.

To be clear, even if we would eventually go with dtype="string" for the default dtype anyways (i.e. the current state of the PDEP text in this PR), I think we should map dtype=str to mean the default string dtype, instead of object dtype. Because dtype=str currently indeed means "give me string data" (just using object dtype, because that's how it works), and we should keep that meaning but using the proper dtype when it is available. The same is probably true for any other alias we currently map to "ensure string data in object dtype"? So that also includes things like "str", "U", np.str_. This is essentially just the same as we map dtype=int to the default int64 dtype (and not to object dtype with python integers)

(this is not actually implemented right now like that when enabling the future behaviour with pd.options.future.infer_string = True, but I would consider that as a missing piece in the implementation and had been planning to open an issue/PR for it)

Jun 04 '24 22:06 jorisvandenbossche

I am starting to get convinced that using "str" instead of "string" for the new default dtype would be a good idea to help the backwards compatibility story, but then I would not go any further than that and just leave it at those two names (and not add other new aliases like "String" or "Str")

So would dtype="string" keep the current behavior 2.x? If so, why not use pd.StrDtype() to represent the new 3.x behavior and let pd.StringDtype() represent the old 2.x behavior? That's basically what I'm suggesting, which would mean all 2.x code would still work, and we deprecate pd.StringDtype() to force people to change if they are using that class.

In the future we can make dtype="string" and dtype=str and dtype="str" mean the same thing (strings with pd.NA),

Jun 04 '24 22:06 Dr-Irv

I think pd.StrDtype() might end up in a no man's land. All pandas types that follow that construction pattern today use NA semantics, and I don't think we are going to introduce an equivalent constructor for the types that would be returned from any pd.StrDtype operations

Jun 04 '24 22:06 WillAyd

My read here is that Joris and Irv are trying reallly hard to find a compromise that Will can get on board with and hitting a complete brick wall. Is there anyone on the fence for whom this is helpful?

On Tue, Jun 4, 2024 at 3:39 PM William Ayd @.***> wrote:

I think pd.StrDtype() might end up in a no man's land. All pandas types that follow that construction pattern today use NA semantics, and I don't think we are going to introduce an equivalent constructor for the types that would be returned from any pd.StrDtype operations

— Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/pull/58551#issuecomment-2148525839, or unsubscribe https://github.com/notifications/unsubscribe-auth/AB5UM6C5AYC57INWF4QXSMLZFY63VAVCNFSM6AAAAABHFWMWBCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBYGUZDKOBTHE . You are receiving this because you were mentioned.Message ID: @.***>

Jun 04 '24 23:06 jbrockmendel

I think pd.StrDtype() might end up in a no man's land. All pandas types that follow that construction pattern today use NA semantics, and I don't think we are going to introduce an equivalent constructor for the types that would be returned from any pd.StrDtype operations

In my proposal, this is temporary, at least for one release.

With pd.StrDtype(), you'd have the arguments storage and na_value that would allow you to get the equivalent of what pd.StringDtype() does today, i.e., pd.StrDtype(storage="python", na_value=pd.NA) is the same as pd.StringDtype(). But pd.StrDtype() would be equivalent to pd.StrDtype(storage = "python | pyarrow", na_value=np.nan)

So the default behavior of pd.StrDtype() would give you np.nan semantics, but you could still get the equivalent of what pd.StringDtype() does today to remove the deprecation warning by calling pd.StrDtype(na_value=pd.NA). If we then deprecate pd.StringDtype(), then in the future we can change the default behavior of pd.StrDtype() to be na_value=pd.NA whenever we're ready to make everything use pd.NA for missing values.

Jun 05 '24 00:06 Dr-Irv

Do you mind expanding on why you think we would deprecate pd.StringDtype at some point? I am under the impression this PDEP would still offer pd.StringDtype(na_value=pd.NA|np.nan) but the default na_value would remain pd.NA. If we wanted pd.StrDtype I assumed that would just be an alias for pd.StringDtype(na_value=np.nan)

Jun 05 '24 04:06 WillAyd

So would dtype="string" keep the current behavior 2.x? If so, why not use pd.StrDtype() to represent the new 3.x behavior and let pd.StringDtype() represent the old 2.x behavior? That's basically what I'm suggesting

If we choose "str" for the new default dtype, then yes dtype="string" would keep the current behaviour. While that's maybe the core of what you were suggesting, you were also suggesting a lot of other things on top of that (adding dtype="Str" as an alias for dtype="string", deprecating StringDtype), and that's what I was responding to.

Jun 05 '24 13:06 jorisvandenbossche

Do you mind expanding on why you think we would deprecate pd.StringDtype at some point? I am under the impression this PDEP would still offer pd.StringDtype(na_value=pd.NA|np.nan) but the default na_value would remain pd.NA. If we wanted pd.StrDtype I assumed that would just be an alias for pd.StringDtype(na_value=np.nan)

I'm thinking of the future state. Let's assume that we go for pd.NA semantics across the board in pandas 4.0. We'd then have pd.IntDtype(), pd.FloatDtype() and pd.StrDtype(), all defaulting to using pd.NA for missing values. There would be no need for pd.StringDtype() because pd.StrDtype() would have arguments that do the same thing.

If in pandas 3.0, we tell people who are using pd.StringDtype() that it is being deprecated, they migrate their code to use pd.StrDtype(na_value=pd.NA). That code will have the same behavior in 3.0 as 4.0. The difference in 3.0 vs. 4.0 in pd.StrDtype() is the default value of na_value changing from np.nan to pd.NA.

So with 3.0, any code that uses pd.StringDtype() still works, with a deprecation warning, and there is a migration path to a future state that uses pd.NA everywhere. And if we decide not to make pd.NA the default everywhere, people who start using pd.StrDtype(na_value=pd.NA) will have working code as it works today in pandas 2.x.

In essence, pd.StrType() is the "new string type", and pd.StringDtype() is the "old string type", and there is a migration path from old to new that is pretty clean, IMHO.

Jun 05 '24 13:06 Dr-Irv

"str"/"string" seems much worse confusion-wise than "string"/"String".

@jbrockmendel an the other hand, we do have a similar naming situation with "bool" vs "boolean" (although in this case "bool" is an actual numpy dtype with no missing value support, but it's similar in "default dtype vs opt-in NA-variant")

I certainly prefer "string" as the dtype name (long term), but in the end I think a newcomer not aware of the differences can be confused about either of those options, while both are "explainable" (and we will need to do a good job doing that in the docs).

Jun 05 '24 13:06 jorisvandenbossche