UnROOT.jl icon indicating copy to clipboard operation
UnROOT.jl copied to clipboard

Add a copy(::LazyBranch) method to prevent branch materialization on …

Open grasph opened this issue 2 years ago • 14 comments

For a LazyBranch br, Base.copy(br) was returning a vector instead of a LazyBranch and leading to uncessary read. This PR fix the issue.

grasph avatar Oct 13 '23 09:10 grasph

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 86.33%. Comparing base (799b290) to head (de28e08). Report is 24 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #286      +/-   ##
==========================================
- Coverage   87.82%   86.33%   -1.49%     
==========================================
  Files          18       18              
  Lines        2366     2445      +79     
==========================================
+ Hits         2078     2111      +33     
- Misses        288      334      +46     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar Oct 13 '23 09:10 codecov[bot]

I'm not so sure about this one. For example, copy(reinterpret(...)) also changes type and return a Vector(). I think it could be argued that current behavior is the correct one already

Moelf avatar Oct 13 '23 09:10 Moelf

Arrow.jl

julia> d = Arrow.Table("/tmp/a.feather");

julia> d.x #this is memory-mapped
100-element Arrow.Primitive{Int64, Vector{Int64}}:
...

julia> copy(d.x)
100-element Vector{Int64}:
...

Moelf avatar Oct 13 '23 09:10 Moelf

According the Base.copy documentation it must "create a shallow copy of x: the outer structure is copied, but not all internal values."

grasph avatar Oct 13 '23 09:10 grasph

but then you also have this from Julia itself being weird/inconsistent:

julia> a = rand(2,2)'
2×2 adjoint(::Matrix{Float64}) with eltype Float64:
 0.651229  0.952613
 0.723043  0.00959459

julia> copy(a)
2×2 Matrix{Float64}:
 0.651229  0.952613
 0.723043  0.00959459

julia> a = rand(2)'
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

julia> copy(a)
1×2 adjoint(::Vector{Float64}) with eltype Float64:
 0.29677  0.177136

edit: it seems the adjoint of vector is a corner case because it's behavior is different from one-row matrix, but the intention here is to change type and return Array.

Let's hold off for a bit and I will ask other Julia folks on Slack about design/expectation. Potentially request a Documentation update

Moelf avatar Oct 13 '23 09:10 Moelf

another thing is I just realized we no longer need mutable struct LazyBranch, it can be immutable.

But after making it immutable it breaks CI and seems like allocation in certain operation is increased somehow.

If we can make it immutable then clearly copy should just be a no-op. Except in this case we need clarification from others regarding copy(<:AbstractVector).

Moelf avatar Oct 13 '23 10:10 Moelf

My idea was that the LazyBranch was conceptually immutable and this would avoid unnecessary materialization with DataFrame if copycols=false is not passed.

Looking at the DataFrames documentation, the behavior of Arrow.Table is to allow conversion of immutable to mutable columns: "[to] get mutable columns from an immutable input table (like Arrow.Table), pass copycols=true explicitly." Changing the immutable nature of the object is arguable, although here it has some practical use here. The issue is that copycols=true is the default, which can be very expansive in HEP use case.

The case of copy(A::Transpose) and copy(A::Adjoint) is special. According to the documentation it "eagerly evaluate[s] the lazy matrix transpose/adjoint.". For sure return an object of the same type as the original is not wrong, it is changing the type which is an exception.

grasph avatar Oct 13 '23 10:10 grasph

The issue is that copycols=true is the default, which can be very expansive in HEP use case

But it is what it is, user of DataFrame needs to know what's default behavior. The problem of overriding copy is what user should do if they indeed want copycols=true?

I agree with your argument but I don't know if this is overly intrusive and confusing (i.e. what if user wants a mutable copy of a column?)

Moelf avatar Oct 13 '23 11:10 Moelf

Concerning the method copy(<:AbstractVector), is is not documented. So I expect it can be considered as fallback implementation and should not be considered as providing the current behavior for all subtypes.

There is a Base.copymutable() function to turn an immutable into a mutable e.g, it turns a tuple to a vector.

grasph avatar Oct 13 '23 11:10 grasph

I got this comment from @vtjnash so maybe we should just return the original object, what do you think?

But maybe the consideration is that, iterating the copy() shouldn't change buffer of the original?

image

Moelf avatar Oct 13 '23 16:10 Moelf

I think both options w and w/o copy of the buffer are valid. Independent buffer should provide better performance in case of concurrent access on two copies as it will prevent buffer invalidation, while no copy option minimizes memory allocations.

About this PR in general, since it is not clear what the best option is, I propose to:

  • postpone the decision to specialize the copy function until we have a better view of the use cases;
  • document the possibility for the user to disable the materialization by defining himself the copy method as a no-op.

That way, we will keep the options opened. What do you think of this plan ?

Philippe.

grasph avatar Oct 15 '23 17:10 grasph

that sounds good to me!

I personally think no-op is okay, because I *think vast majority use cases are similar to when you want to do something in a different container type, e.g. DataFrame, the user probably stops interacting with the original LazyBranch variable.

To me preventing full materialization seems useful enough

Moelf avatar Oct 15 '23 17:10 Moelf

then that is a violation of the copy API, which states that operations on b do not affect a

ah ok we can't do no-op, so I think we know exactly what we can do in this case.

Moelf avatar Oct 16 '23 16:10 Moelf

This is really difficult since copying something lazy needs to decouple the structure from its laziness in order to fulfil what copy signals: having something which I can operate on without modifying the original thing.

We need to either break the laziness and materialise everything inside the structure (but then it's not a lazy thing anymore) or we create a completely different object with it's own references and independent caches, which sounds like a source of happy bugs 🙈

I don't know... I think changing the type is also a bit of a surprise, even though it makes sense in certain applications.

I agree that we should figure out what a user expects to be served when doing a copy of lazy branches.

tamasgal avatar Oct 16 '23 16:10 tamasgal