ldc perf: optimize indexing in inner loops

perf: optimize indexing in inner loops

Open ryuukk opened this issue 6 months ago • 2 comments

Benchmark: https://github.com/attractivechaos/plb2/pull/6

According to note [1], clang is able to optimize them, shouldn't LDC be able to as well?

[1] - https://github.com/attractivechaos/plb2?tab=readme-ov-file#optimizing-inner-loops

Jan 03 '24 11:01 ryuukk

Please add more details here. I have to search different pages and source code to figure out exactly what you mean and what difference you observe between Clang and LDC.

First thing that comes to mind is that for the D source, the optimizer is not able to prove that writing to c[i][j] does not change contents a or b, i.e. that c does not alias a or b; without that proof, it is invalid to hoist a[i][k] or b[k] from the loop. In the C source, the optimizer can see calls to malloc and calloc which guarantee that c can never alias a or b. In D, the call to _d_newarraymiTX (auto c = new double[][](m, p);) is IIRC not recognized (nor marked) as a memory allocation function, thus the optimizer does not get the information to prove that things don't alias. My first attempt to improve the situation would be to replace the allocation with something that the optimizer does recognize.

Jan 03 '24 12:01 JohanEngelen

Please add more details here. I have to search different pages and source code to figure out exactly what you mean and what difference you observe between Clang and LDC.

Sorry, i thought using the diff would be enough, because my lack of expertise would have made me choose incorrect wording

https://github.com/attractivechaos/plb2/pull/6/files

Particularly the 2nd commit makes things even faster:

https://github.com/attractivechaos/plb2/pull/6/commits/7dc32f643ed786996c65dd0954bffebd6510e98b

Result in the comment in the 2nd paragraph: https://github.com/attractivechaos/plb2/pull/6#issue-2063736667

First thing that comes to mind is that for the D source, the optimizer is not able to prove that writing to c[i][j] does not change contents a or b, i.e. that c does not alias a or b; without that proof, it is invalid to hoist a[i][k] or b[k] from the loop. In the C source, the optimizer can see calls to malloc and calloc which guarantee that c can never alias a or b. In D, the call to _d_newarraymiTX (auto c = new double[][](m, p);) is IIRC not recognized (nor marked) as a memory allocation function, thus the optimizer does not get the information to prove that things don't alias. My first attempt to improve the situation would be to replace the allocation with something that the optimizer does recognize.

Thanks for the clear explanation

PR got merged, result now is equal to what C with clang produce https://github.com/attractivechaos/plb2?tab=readme-ov-file#appendix-timing-on-apple-m1-macbook-pro

Jan 03 '24 14:01 ryuukk

ldc ldc copied to clipboard

perf: optimize indexing in inner loops

ldc
ldc copied to clipboard