pandas PERF: Bypass chunking/validation logic in StringDtype__from_arrow_

Instead of converting each chunk to a StringArray after casting to array and then concatenating, instead use pyarrow to concatenate chunks and convert to numpy.

Finally, bypass validation logic (unneeded as validated on parquet write) by initializing NDArrayBacked instead of StringArray.

This removes most of the performance overhead seen in #47345. There is still a slight overhead when comparing to object string arrays because of None -> NA conversion. I found that leaving that out still results in NA types in the example I gave (and would actually improve performance over the object case), but this is not consistent and thus conversion is left in.

[x] closes #47345
[x] Tests added and passed if fixing a bug or adding a new feature
[x] All code checks passed.
[x] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Jul 18 '22 19:07 timlod

Good point, I hadn't considered this. No - afaict this code requires pyarrow 3.0 (pyarrow.concat_arrays as well as array.to_numpy(zero_copy_only=False) were both introduced in 3.0, whereas https://pandas.pydata.org/pandas-docs/stable/getting_started/install.html states pyarrow 1.0.1 as the minimum version.

I understand this performance issue alone may not be sufficient reason to bump a version, but, in general, what would be the requirements for that? There's just 5 months between the two releases. Edit: Looking at how pyarrow has been bumped in accordance with new pandas versions in the past, it feels like moving to pyarrow 3 for pandas 1.5 could be reasonable (current pyarrow is version 8).

Jul 20 '22 08:07 timlod

Could you open an issue about bumping pyarrow? The we can discuss there and move forward from that

Aug 05 '22 15:08 phofl

Any way doing this without requiring 3.0? Otherwise would have to wait for a bit

Aug 26 '22 21:08 phofl

I think it's possible to implement something that's already a little better than what's on 1.4 without requiring pyarrow 3. However, it's probably wise to switch to how it's done in this PR once pandas does require pa3. I could make another PR later this week, if that's not too late for this release - and this one could be kept open for 1.5.1.

Aug 29 '22 07:08 timlod

Depends on the nature of the change, we don’t backport anything big to a release candidate.

this one would have to wait for 1.6, we avoid Performance things on 1.5.x

Aug 29 '22 07:08 phofl

In that case, I think it's fine to just wait for 1.6 and make this change directly. One can work around the performance impact by using object strings until then.

Sep 03 '22 09:09 timlod

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Oct 04 '22 00:10 github-actions[bot]

We just increased the minimum version to 6.0, so we could finish this

Oct 17 '22 20:10 phofl

Excellent, I'll revisit this soon!

Edit: I recently found that pyarrow's to_pandas() method can be the bottleneck when loading large parquet files that are read as large chunked arrays. I think implementing a similar logic (using pyarrow's own methods over concatenating lists of numpy arrays) for other datatypes might drastically improve read performance. Would it make sense to open a larger PR containing all those changes (if I can show improvements), or add those here?

Oct 18 '22 07:10 timlod

Would it make sense to open a larger PR containing all those changes (if I can show improvements), or add those here?

Smaller, singular topic scoped PRs would be preferred

Oct 19 '22 16:10 mroeschke

I think this is ready then - I just changed the whatsnew edit, the code change stays the same.

I also briefly checked what I thought might have improved performance across the other dtypes, but this wasn't so. There may be some parts where one could switch to pyarrow concatenation, but those that I checked (integer numerical) didn't yield performance improvements (and may result in some memory overhead).

Oct 23 '22 12:10 timlod

Can you merge main?

Jan 19 '23 22:01 phofl

@timlod there is a merge conflict here but since the rc is now cut this would probably need the release note moved to 2.1

Feb 22 '23 16:02 simonjayhawkins

Thanks for sticking with this @timlod

Feb 24 '23 18:02 mroeschke

PERF: Bypass chunking/validation logic in StringDtype__from_arrow__