~270 times faster writing (or maybe not writing?) with arr[i] than with *(arr+i)?
I just stumbled into this and it's not actually causing me a specific problem. Bug or not? Maybe.
- [x ] Checked for duplicates I tried to.
Describe the bug
Looping over array writes using arr[i] is about 270 times faster(edited) than *(arr+i). I am not claiming this is definitely a real difference in write performance. That's complicated.
Expected behavior
I think there's room to quibble, for an interpreter especially, what "identical" means below, but I'd expect more similar results.
According to the standard (7.6.1.2 in maybe latest version)
"The expression E1[E2] is identical (by definition) to *((E1)+(E2)), except that in the case of an array operand, the result is an lvalue if that operand is an lvalue and an xvalue otherwise."
To Reproduce
Just run this in cling:
#include <iostream>
#include <chrono>
// Include necessary libraries
void benchmark() {
const int size = (int)(1E7);
volatile int* arr = new volatile int[size]; // Create a dynamic array
const int multiplier = 10; // loop the loop this many times
//prefill to page memory maybe:
for (int i = 0; i < size; i++) {
arr[i] = size-i;
}
auto start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < multiplier; j++){
for (int i = 0; i < size; ++i) {
arr[i] = i;
}
}
auto end = std::chrono::high_resolution_clock::now();
auto duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
std::chrono::duration<double> duration = std::chrono::duration<double>(duration_ns);
std::cout << "Indexed: " << duration.count() << " seconds" << std::endl;
volatile int* arr2 = new volatile int[size]; // Create a dynamic array
start = std::chrono::high_resolution_clock::now();
for (int j = 0; j < multiplier; j++){
for (int i = 0; i < size; ++i) {
// arr2[i] = i; // swap these lines for control test.
*(arr2+i) = i;
}
}
end = std::chrono::high_resolution_clock::now();
duration_ns = std::chrono::duration_cast<std::chrono::nanoseconds>(end - start);
duration = std::chrono::duration<double>(duration_ns);
std::cout << "Pointer math " << duration.count() << " seconds" << std::endl;
}
int main () {
benchmark();
}
I get some fluctuation but this is a typical output:
Indexed: 0.104759 seconds
Pointer math 27.4305 seconds
Setup
------------------------------------------------------------------
| Welcome to ROOT 6.30/04 https://root.cern |
| (c) 1995-2024, The ROOT Team; conception: R. Brun, F. Rademakers |
| Built for linuxx8664gcc on Jul 26 2024, 08:45:57 |
| From heads/master@tags/v6-30-04 |
| With c++ (Ubuntu 9.4.0-1ubuntu1~20.04.2) 9.4.0 |
| Try '.help'/'.?', '.demo', '.license', '.credits', '.quit'/'.q' |
------------------------------------------------------------------
Compiled myself. This is an ubuntu VM running in VMware workstation on windows 10. I tried to keep system loads low and reproduce results.
Additional context
For what it's worth. I also made a version using a memory-mapped file for allocation, to maybe prevent optimizing away the loop writes, and results were the same.
Actually understanding what's going here is not so simple. The only way I know to be certain to prevent optimizing the writes away is to isolate them in separate compiled binary functions so that the interpreter cannot know what the callee is doing. I wrote a version like that as well. To prevent too much function call overhead, I compiled the inner loops as separate compilation units (.so files) using cling, and interpreted the outer code and outer loop. That gave 0.4s for both. This is also similar to what I get with g++ with default optimization.
So it's tempting to say that for the posted version, a) the interpreter is just really slow, and b) it optimizes away the writes in the indexed case. Even so, this is still unexpected behavior to me in that it it treats them differently.
But this may not be the case. The 0.1 seconds posted result is not the same as removing the loops. That gives about 1ns. Also the time does scale proportionally to the total number of writes, both inner and outer loops. So the loops are surely not being optimized away. The actual writes might be. But finally, in the modular version, (which I can provide) g++ -O3 gives ~ 0.04s (4E-2)s for each, and again, that scales correctly with both loop lengths. This tempts the conclusion that 0.1s in the original result can be real, and that the difference would be optimization, but not optimizing away the writes.
The total range here, real or not, is between 0.04s to 27s, a factor of 675.
For what it's worth, for fun I wrote optimized an assembly version of the inner loop using ~maximally unrolled~ maximally gratuitous SSE, the core of the code (that might even be right) being:
.PLOOP:
movdqu [rdi + rsi + 0], xmm8
movdqu [rdi + rsi + 16], xmm9
movdqu [rdi + rsi + 32], xmm10
movdqu [rdi + rsi + 48], xmm11
movdqu [rdi + rsi + 64], xmm12
movdqu [rdi + rsi + 80], xmm13
movdqu [rdi + rsi + 96], xmm14
movdqu [rdi + rsi + 112], xmm15
paddd xmm8, xmm0
paddd xmm9, xmm1
paddd xmm10, xmm2
paddd xmm11, xmm3
paddd xmm12, xmm4
paddd xmm13, xmm5
paddd xmm14, xmm6
paddd xmm15, xmm7
add rsi, 128
cmp rdx, rsi
jne .PLOOP
cmp ecx, eax
je .END
And this gives about 0.045s for the same 1E8, and I'm 100% sure it's not optimized away obviously. Being in a VM it's likely I/O limited. But this kind of sets the limit for performance in this context I think.