pandas Initial Backport of string changes for 2.3 release

trafficstars

Aug 14 '24 16:08 WillAyd

I'm unsure about the failing tests. Seems like we might have behavior in our string types that works off of what is done in https://github.com/pandas-dev/pandas/pull/57361 but guessing we don't want to backport that PR to 2.3?

Aug 15 '24 19:08 WillAyd

A lot of failures seem related to the handling of to_numeric with errors="ignore", which was removed for the 3.0 release.

Prior to https://github.com/pandas-dev/pandas/pull/57361, this is what a to_numeric call would yield:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
'-47393996303418497800'

with all of the backport, it looks like the new behavior casts it to a float:

>>> pd.to_numeric('-47393996303418497800', errors="ignore")
-4.73939963e+19

Aug 21 '24 20:08 WillAyd

@mroeschke could I get some of your help on the test failures? A lot of them look related to NumPy and have to do with how it handles overflows, casting, and some syntax things with numexpr.

Any chance you saw those on main and know what the fix is? https://github.com/pandas-dev/pandas/actions/runs/10514684927/job/29133089833?pr=59513#step:8:2013

Aug 22 '24 20:08 WillAyd

@WillAyd looking at the failures in the first build, i think #58484 may need to be backported

Aug 22 '24 20:08 jbrockmendel

I don't remember seeing failures related to eval but I'll dig a little bit more.

In addition to what @jbrockmendel mentioned, you'll need https://github.com/pandas-dev/pandas/pull/59545 and the nomkl install part of https://github.com/pandas-dev/pandas/pull/59553

Aug 22 '24 21:08 mroeschke

Getting close here

___________________ TestBasic.test_write_index[fastparquet] ____________________
[gw2] linux -- Python 3.12.5 /home/runner/micromamba/envs/test/bin/python3.12
[XPASS(strict)] fastparquet write into index

Aug 26 '24 16:08 jbrockmendel

I don't remember seeing failures related to eval but I'll dig a little bit more.

Already reported in May (und unfortunately not picked up by anyone) here: #58548. It tracks the failures in main and sees a fix somewhere between a3e751c6b4 and 236d89b856.

Fixed by backport here: #59535

Aug 27 '24 18:08 bnavigator

It tracks the failures in main and sees a fix somewhere between a3e751c and 236d89b.

I don't believe that issue was directly solved, but it was at least removed from the test suite as part of changes in https://github.com/pandas-dev/pandas/pull/56709/files#diff-f145e441b5820d235a78589ee9ee6c9c49fea0de6ca659a972b61b6aa29f9df8 that replaced the fixture of:

@pytest.mark.parametrize("dtype", [np.float32, np.float64])

with a fixture that instead uses:

[float, "float32", "float64"]

Aug 27 '24 19:08 WillAyd

@WillAyd

Mind if I push some fixes for the non string dtype builds to your branch? I finally got to taking a look at this PR and I have a couple of fixes locally.

Sep 07 '24 04:09 lithomas1

Sure by all means. FWIW I've been trying to keep the order of cherry picked commits maintained and put any custom patches at the end

Sep 07 '24 13:09 WillAyd

Sure by all means. FWIW I've been trying to keep the order of cherry picked commits maintained and put any custom patches at the end

Ah cool.

I'll first push to another branch first just do test my changes, but if it passes CI, I'll reintegrate here (as I don't want to break things by force pushing immediately).

Seems to be around 200 failures to go on the string dtype builds (after clearing some massively parametrized failing tests). We're close :)

EDIT 2: Around 60 failures each for both future strings builds now.

Sep 08 '24 19:09 lithomas1

@WillAyd - thought I had set upstream to my repo, but was apparently not. Last commit undoes the previous two.

Sep 15 '24 12:09 rhshadrach

For many of the now failing builds, the ones I checked have one failing test due to a new release of pytz. https://github.com/pandas-dev/pandas/pull/59016 might resolve, but maybe we just want to xfail the test?

Not sure I understand the overall strategy here: for the many xfails, it's not clear to me which we want to fix before merging if any.

Sep 15 '24 14:09 rhshadrach

(tagging build so we get some CI signal from the wheel builders as well)

Sep 20 '24 18:09 lithomas1

@WillAyd looking into some of the failures, and pushed a few commits:

first commits is fixing up a backported PR's commit (feel free to merge that with the actual commit in a next rebase)
second and fourth (coming) commit are general fixes to tests (for tests that for some reason (e.g. removed) didn't need a fix on main, but need a fix here). Feel free to combine those all in a single commit
third commit is backporting https://github.com/pandas-dev/pandas/pull/59487, which seemed to be missing here. The actual code change in that PR does not apply on this branch, so in practice this PR has become a test-only change. But those test changes were still needed to get a bunch of tests passing

With those changes, the number of failures went from 916 to around 140

Sep 20 '24 22:09 jorisvandenbossche

Seeing several of the CoW related tests failing, that raises the question if we want to be testing the string dtype with CoW enabled as well or not. In general, it should of course work with CoW enabled or disabled (so people can test both independently), but I am not sure that is worth the extra effort to also test both explicitly ..

Just for testing, I pushed a commit that enables CoW on the string CI build, and that gave a net reduction of 49 failures to 32 failures.
(so it is also not the biggest difference)

Sep 23 '24 14:09 jorisvandenbossche

FYI, some of the remaining failures are because of the "FutureWarning: Dtype inference on a pandas object (Series, Index, ExtensionArray) is deprecated ..." warning being triggered.
@lithomas1 had a commit related to that in https://github.com/lithomas1/pandas/pull/46/commits/01b5d26c559dfddc8a4fd30ddf89d579a336c594, but I am not entirely sure that is the correct fix. I think generally we should still raise the warning also for object -> str conversion, because the warning is still relevant for users (with pandas 3.0 we will not infer str from object dtype when the constructor argument is a pandas object). So I do think that those failures require a bit of investigation for why they are triggering that warning and if something needs to be fixed in the code to avoid triggering it.

Sep 23 '24 14:09 jorisvandenbossche

Looks to be about 10 failures with the message:

FutureWarning: <class 'pandas.core.arrays.string_.StringArrayNumpySemantics'>._reduce will require a `keepdims` parameter in the future

I think mostly stemming from calling sum on an array of strings. @jorisvandenbossche do you think we should just catch these warnings for now or is there something in the branch we'd like to implement?

Sep 24 '24 00:09 WillAyd

@WillAyd you can probably use https://github.com/lithomas1/pandas/pull/46/commits/277afe309d0a64cd9f0db9d40df387cbb71e4114 for the keepdims issue

Sep 24 '24 16:09 jorisvandenbossche

https://github.com/pandas-dev/pandas/pull/59893 might need backporting here for the ARM job timezone failures

Sep 26 '24 00:09 mroeschke

Green! ;)

I will do a cleanup of the commits a bit, before (rebase) merging

Oct 03 '24 20:10 jorisvandenbossche

pandas pandas copied to clipboard

Initial Backport of string changes for 2.3 release

pandas
pandas copied to clipboard