Imath
Imath copied to clipboard
Make vectors trivially relocatable.
Hello,
I'm sorry for making the same issue three times. I made Vec2, Vec3 and Vec4 trivially relocatable. I didn't change the behavior of them. Could you please merge this PR?
@AlexMWells Sorry to drag you into this, but I know that in the past you have gone super deep on how tweaks to these classes can help or inhibit vectorization in loops. Is there anything here that would concern you?
Historically with vectorizing compilers, a data type can have its members turned into scalars with a "scalar replacement of aggregates" optimization pass. Once a data type has become scalars, its members can live in registers vs. being forced to standard data layout and being kept on the stack. This is important to minimize the movement between stack and registers when vectorizing as that is when an AOS/SOA transformation might be required and generate extra instructions. One thing that can disqualify SROA is passing the address of the structure to a function call, as that would require the standard data layout and the object exist on the stack vs registers.
Now the wrinkle is the "default" copy constructor and possibly other default operators is that a compiler could implement a memcpy(*this, &src, sizeof(T). And the taking the address of "src" could disqualify SROA and end up forcing it to exist on the stack in standard data layout.
Perhaps another optimization pass would convert the memcpy to assignments before SROA is applied, but those can often be specialized (just for 2 or 3 data members in a specific pattern). So defensive programming technique was to explicitly provide those constructors and operators with member-wise assignment/copies vs. letting the default possibly use memcpy/memset.
Now fast forward many years and std::is_trivially_copyable was introduced in c++11 and then compilers/libraries start adding it to memcpy and other places with warnings, etc.
In short I think you might want to keep both ways of having those default or not controllable by a define.
Thank you for your advices. I'll fix to keep both ways.
I fixed codes following an advice, but an assertion error occured. One of the causes seems that return Vec2 (x / l, y / l)
in Vec2::normalized behaves strangely. Unfortunately, I couldn't fix it.
I found that only changing copy constructor from user-defined to default caused an assertion error, so I give up making Vec3
trivially relocatable.
Thank you very much!
Here is the summary as I understand it:
Using =default
makes it trivially moveable, and that might have a number of advantages for the vast majority of users.
But there are a few specialized situations where it could sabotage certain auto-vectorization cases. Those situations can be locally fixed if there's a way to include the header files in such a way that it doesn't use the =default versions in a module-local way (say, by defining a symbol before including the Imath header). OSL is one of these. If Imath itself does not allow this, we might have to do something even uglier like replicate the Imath headers inside OSL and modify them to allow it.
I'm sorry, but what is OSL?
Open Shading Language
Thank you!
We discussed this in today's technical steering committee meeting. We support this addition in principle, the ultimate objective seems sound, but we're concerned about adding additional complexity through another batch of #if's in the code without a thorough analysis of the benefits. As @AlexMWells points out, the actual performance impact is complicated to assess for all cases.
@sukeya, can you describe more about the motivation for the change? Was this motivated by a specific use case? Did you do a performance analysis that quantifies the impact?
We also discussed briefly whether the same objective could be achieved by adding the trivially_relocatable attribute to the class definition itself, rather than adding new class methods.
We are planning a v3.2 release in mid-August, which will be cut from the main branch, so we'd like to wait until after that before merging the PR. That will provide more time for evaluation before official release.
Thanks for the contribution and thanks for your patience.
I see. I will do a performance analysis.
I talk about the motivation for the change. I have been writing a snow simulator which reads polygon meshes from an Alembic file exported by Blender. I think that the exported file generally stores a pair of a transform matrix (Imath::M44f
) and a polygon mesh (each type of a point and a normal is std::vector<Imath::V3f>
) at each time, so I have to calculate the transformed polygon mesh from them. However, my simulator will heavily use CPU, so I would like to assign this task to GPU asynchronously.
Thank you.
Adding the trivially_relocatable attribute is good for me.
I have done a performance analysis. The result is the following:
GPU | ||
---|---|---|
Test case | Avg time [ms] | Branch name in my forked repository |
user defined | 17.5 | make_vector_trivially_relocatable_performance_test |
default | 12.0 | defalut_ver |
default and async | 11.8 | async_ver |
CPU | ||
---|---|---|
Test case | Avg time [ms] | Branch name |
user defined | 2.4 | tbb_ver |
default | 2.25 | tbb_defalut_ver |
The environment is the following:
OS | Ubuntu 22.04 |
CPU | Intel Core i7 8700 |
GPU | NVIDIA GeForce RTX 2060 (12GB) |
Memory | DDR4-2666 32GB |
Compiler | GCC 12.1.0 |
The test data is a pair of 1 million 3 dimension vector generated random and a 4 dimension matrix. Let p
be a 3 dimension vector and M
be a 4 dimension matrix, the test performs p * M
for all vector.
The programs needs oneTBB(2021.5.0 or higher) and CUDA(12.2), and must be cloned recursively.
I didn't calculate the deviation of each test case, but felt that each sample in the test cases using CPU and default and async
test case using GPU is in a range [0.95 * avg, 1.05 * avg].
Can you clarify what exactly is the difference between "default" and "user defined"?
I see.
"defalut" means that the following functions of vectors are declared as default
:
- move constructors and move assignment operators,
- copy constructors and copy assignment operators.
"user defined" means that vectors use current implementation of said functions.
The performance difference on GPU is striking.
@lgritz Since you & I debated it earlier in this thread, a half year ago, and pursuant to @Quuxplusone's comments, I'm still of the opinion that we shouldn't guard default with IMATH_VEC_USE_DEFAULT_CONSTRUCTOR
. Since we last discussed this PR we've mostly collectively moved forward to C++17, and I just point that out in a spirit of moving forward with language features, and not keeping the old business around.
@meshula Do we not still have the problems that @AlexMWells detailed, where the = default
might derail certain vectorization that we rely on? Making sure we could selectively have it the way that's better for vectorization was the point of suggesting the guard in the first place. That's unrelated to C++17.
The website check failed because of a change in required versions at readthedocs. This is fixed in the main branch. @sukeya , could you possibly merge the main branch with yours to pick up those fixes?
@Quuxplusone and @sukeya, I need some more time to refresh my understanding of the proposal, but I think the consensus is it's headed in the right direction.
We'd ideally like to have an analysis like @AlexMWells describes above. Some of the current constructs were set in place after such a careful expert analysis, although that was a long time ago with different compilers. It's definitely worth modernizing, but also good to confirm there aren't unintended consequences.
@Quuxplusone and @sukeya, I need some more time to refresh my understanding of the proposal, but I think the consensus is it's headed in the right direction.
We'd ideally like to have an analysis like @AlexMWells describes above. Some of the current constructs were set in place after such a careful expert analysis, although that was a long time ago with different compilers. It's definitely worth modernizing, but also good to confirm there aren't unintended consequences.
FWIW, I believe I see now the scenario that @AlexMWells is worried about: Imagine a compiler that does escape analysis after inlining, and also optimizes a trivial constructor into memcpy, and also its escape analysis does not understand the semantics of memcpy (i.e. that its pointer arguments do not escape). Then we'd have this: https://godbolt.org/z/YjKcd1Pf1 Notice that Clang optimizes the return; but if you change memcpy
to not_memcpy
, it stops optimizing. This indicates that Clang's escape analysis does understand that memcpy is special and doesn't escape its arguments. Of the compilers on Godbolt, it looks to me like GCC 4.1.2 doesn't understand memcpy but GCC 4.4.7+ do. All Godbolt-available versions of Clang (3+) and MSVC (19.14+) understand memcpy.
I don't understand precisely how this kind of thing would interact with vectorization, but based on these observations I'd be surprised if memcpy interacted poorly with anything (in a recent enough compiler).
I know another surprising pitfall where if a library skips running a trivial (pseudo-)destructor, lifetime analysis might not realize the object is dead; but my test case for that got fixed in GCC 11, and only ever happened on GCC AFAIK anyway. I bring it up merely as another case where making something more trivial can cause worse performance — but I'm quite confident that I wouldn't let that case stop me from making anything more trivial.
Anyway, my main contribution to this PR is to say that "trivially relocatable" should just be "trivial" everywhere it appears in this PR; and to invite people interested in relocatability to comment in support of the Clang implementation https://github.com/llvm/llvm-project/pull/84621 . :)