go.arm64
                                
                                 go.arm64 copied to clipboard
                                
                                    go.arm64 copied to clipboard
                            
                            
                            
                        liblink, cmd/7l: support more efficient relocations
We might want to investigate a R_ADDRARM64 relocation that would not require pool literals.
Right now MOV g(SB), R is
MOV pool(PC), TMP
MOV (TMP), R
...
DWORD   g(SB)
So it uses 8 bytes of instructions and 8 bytes of pool. Note that the pool literals are not coalesced (perhaps this can be implemented). We can say that the cost of using R_ADDR is:
Cost(n * R_ADDR) = {2n+2n bytes, 2n loads}
A potential R_ADDRARM64 might do this instead:
ADRP    g(SB), TMP
MOV off(TMP), R
The ADRP instruction calculates the 4k page which holds g(SB), the final MOV can just encode an immediate offset since it's in C_ADDCON0 class.
For MOV g(SB), R the generated code is:
ADRP    g(SB), R
ADR g(SB), R
Note that we don't even need the temporary register, and we don't even do any loads.
In both cases we don't need pool literals at all, and they only do one or zero loads!
Cost(n * R_ADDRARM64) = {2n bytes, 0.5n loads}
It might be worth doing this. This will affect #49.
Note that if we do this, once we solve #6, which is mostly solved, the only remaining role of pool literals is for huge literals. We could easily put them in the data segment instead, and remove the pool completely. I don't know if that's an argument for, or against this change.
The referenced commit largely fixes this issue. The only remaining thing is to optimize the three instruction form back to two instruction form:
ADRP addr(SB), R27
ADR addr(SB), R27
MOV 0(R27), Rx
The reason I didn't do it is because the ELF standard defines separate relocation types for the same relocation (ADR/ADD/MOV/MOVB/MOVH, ... each have their own relocation, but that's unnecessary, because the change to the instruction is actually the same. This design is most probably motivated by the desire that rela relocations shouldn't look at the bytes being relocated, but that is not true even given all those relocations because the register is still only encoded in the original instruction.)
Note that this is still better than the status quo: for each load, we needs 3 instructions but no constant pool, whereas the status quo is 2 instructions plus 8 byte constant pool, we have saved 1 memory load and at least 4 byte (8 byte if the the constant pool entry needs extra padding for alignment.)