pandas CoW: Use weakref callbacks to track dead references

[x] closes #55245
[x] Tests added and passed if fixing a bug or adding a new feature
[x] All code checks passed.
[x] Added type annotations to new arguments/methods/functions.
[ ] Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

This is an alternative solution to #55518. What it happening here:

Uses the weakref.ref callback to count the number of dead references present at the referenced_blocks list.
Takes 20% more time at the constructor because of the function declaration!
- The function declaration can be moved to the add_reference and add_index_reference, but this is just moving the performance hit to these methods.
Everything is O(1) now, including has_reference. (_clear_dead_references is amortized to O(1))
If the number of dead references is more than 256 and +50% of the array, it will prune these references
- The implication is that the linear slow dead reference cleaning happens next to the GC calls, which I think this is a more intuitive behavior.

(also check another possibilities at #55008)

Let me know if I should continue working on this solution.

Oct 16 '23 01:10 wangwillian0

Assuming this is an alternate solution to @phofl PR in #55518

Oct 16 '23 12:10 WillAyd

#55518 needs to be back ported. This is something that we can consider for main and 2.2, but I want to test this more before we rely on it.

Oct 16 '23 12:10 phofl

I just fixed the tests. Any thoughts about it @phofl @WillAyd?

Nov 05 '23 20:11 wangwillian0

Any news?

Dec 01 '23 15:12 wangwillian0

Personally I would be more partial to exploring https://github.com/pandas-dev/pandas/issues/55631 (IMO I think the mental model of using a set would be simpler)

Dec 01 '23 18:12 mroeschke

Can you run the ctors.py benchmarks?

Dec 01 '23 18:12 phofl

Sorry for the delay. Here are the benchmarks:

Change	Before [593fa855]	After [98addada]	Ratio	Benchmark (Parameter)
+	5.74±0.2ms	7.37±0.2ms	1.29	frame_ctor.FromArrays.time_frame_from_arrays_int
+	7.65±0.08ms	8.63±0.2ms	1.13	frame_ctor.FromArrays.time_frame_from_arrays_sparse
+	1.43±0.07ms	1.47±0.06ms	1.03	frame_ctor.FromRecords.time_frame_from_records_generator(1000)

Dec 05 '23 01:12 wangwillian0

How does this correspond to the other timings in the issue?

Dec 05 '23 23:12 phofl

How does this correspond to the other timings in the issue?

Not sure if I fully understand what is being asked, but this ~20% increase is from the overhead in the constructor. What this PR will be better is mainly at has_reference, which will be O(1) independently of the number of references. Insertions also should be a little faster because there is no periodic calls to the cleanings method, but this is probably not meaningful in any way,

Dec 06 '23 03:12 wangwillian0

Updates?

Dec 13 '23 14:12 wangwillian0

This pull request is stale because it has been open for thirty days with no activity. Please update and respond to this comment if you're still interested in working on this.

Jan 13 '24 00:01 github-actions[bot]

Hey @phofl, what is the current status of this PR?

Jan 15 '24 04:01 wangwillian0

@phofl?

Feb 08 '24 01:02 wangwillian0

Hi, @phofl, can you give an update?

Feb 17 '24 21:02 wangwillian0

pandas pandas copied to clipboard

CoW: Use weakref callbacks to track dead references

pandas
pandas copied to clipboard