mesa-lima
mesa-lima copied to clipboard
Recombine FS inputs into vectors
NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.
We need to recombine inputs from scalar to vectors.
See https://www.mail-archive.com/[email protected]/msg189216.html
Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.
kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"
impl main {
decl_reg vec4 32 r0
decl_reg vec2 32 r1
block block_0:
/* preds: */
vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */ /* vVaryingColor */
vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */ /* vVaryingColor */
vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */ /* vVaryingColor */
vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */ /* vVaryingColor */
r0.x = imov ssa_1
r0.y = imov ssa_2.x
r0.z = imov ssa_3.x
r0.w = imov ssa_4.x
vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */ /* vTexCoord */
vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */ /* vTexCoord */
r1.x = imov ssa_6
r1.y = imov ssa_7.x
vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
vec4 32 ssa_10 = fmul r0, ssa_9
intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */ /* gl_FragColor */
/* succs: block_0 */
block block_0:
}
========prog========
-------block------
const 0 ssa0
st_col 15 new
mul 14 ssa10
mov 5 reg0
ld_var 1 ssa1
mov 6 reg0
ld_var 2 ssa2
mov 7 reg0
ld_var 3 ssa3
mov 8 reg0
ld_var 4 ssa4
ld_tex 13 ssa9
mov 11 reg1
ld_var 9 ssa6
mov 12 reg1
ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
mul 14 ssa10
mov 5 reg0
ld_var 1 ssa1
mov 6 reg0
ld_var 2 ssa2
mov 7 reg0
ld_var 3 ssa3
mov 8 reg0
ld_var 4 ssa4
ld_tex 13 ssa9
ld_coords 16 new
mov 11 reg1
ld_var 9 ssa6
mov 12 reg1
ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14 null 17 null null null |
001: null null null null null null 5 null null |
002: 1 null null null null null null null null |
003: null null null null null null 6 null null |
004: 2 null null null null null null null null |
005: null null null null null null 7 null null |
006: 3 null null null null null null null null |
007: null null null null null null 8 null null |
008: 4 null null null null null null null null |
009: 16 13 null null null 18 null null null |
010: null null null null null null 11 null null |
011: 9 null null null null null null null null |
012: null null null null null null 12 null null |
013: 10 null null null null null null null null |
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail
In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.
As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.
I guess it's here: https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107
Each PP can have a memory stack which I guess is used to store tmp.
There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.
I'm not sure what's the difference between them.
lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.
@yuq how do you tell which one is dummy?
In this function: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303
LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.
@yuq, what about LIMA_PP_STACK_SIZE?
And why LIMA_PP_STACK is not used for mali450?
LIMA_PP_STACK is used by mali450 here: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321
Just because bcast will set same address to all PPs, so I have to set it individually for each PP.
LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one is not dummy.
I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.
From what I understand there are two issues,
- recombine scalars to vec4 to restore the earlier behavior and
- implement register spilling to avoid this kind of problem in the future
Is that correct? Anybody working in any of these?
Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?
I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.