mesa-lima icon indicating copy to clipboard operation
mesa-lima copied to clipboard

Recombine FS inputs into vectors

Open anarsoul opened this issue 6 years ago • 15 comments

NIR splits outputs and inputs in nir_lower_io_to_scalar_early() and unfortunately for FS it means increased number of instructions and increased register pressure.

We need to recombine inputs from scalar to vectors.

See https://www.mail-archive.com/[email protected]/msg189216.html

anarsoul avatar May 24 '18 00:05 anarsoul

Right, I also meet this problem after 18.0 rebase. But due to want to focus on kernel, I haven't solve it from the root. Now I'm doing 18.1 rebase, so will cover these NIR changes in a better way.

yuq avatar May 24 '18 02:05 yuq

kmscube -M rgba fails in lima-18.0 branch due to this issue with "ppir: ppir: regalloc fail"

anarsoul avatar May 25 '18 06:05 anarsoul

impl main {
	decl_reg vec4 32 r0
	decl_reg vec2 32 r1
	block block_0:
	/* preds: */
	vec1 32 ssa_0 = load_const (0x00000000 /* 0.000000 */)
	vec1 32 ssa_1 = intrinsic load_input (ssa_0) () (1, 0) /* base=1 */ /* component=0 */	/* vVaryingColor */
	vec1 32 ssa_2 = intrinsic load_input (ssa_0) () (1, 1) /* base=1 */ /* component=1 */	/* vVaryingColor */
	vec1 32 ssa_3 = intrinsic load_input (ssa_0) () (1, 2) /* base=1 */ /* component=2 */	/* vVaryingColor */
	vec1 32 ssa_4 = intrinsic load_input (ssa_0) () (1, 3) /* base=1 */ /* component=3 */	/* vVaryingColor */
	r0.x = imov ssa_1
	r0.y = imov ssa_2.x
	r0.z = imov ssa_3.x
	r0.w = imov ssa_4.x
	vec1 32 ssa_6 = intrinsic load_input (ssa_0) () (0, 0) /* base=0 */ /* component=0 */	/* vTexCoord */
	vec1 32 ssa_7 = intrinsic load_input (ssa_0) () (0, 1) /* base=0 */ /* component=1 */	/* vTexCoord */
	r1.x = imov ssa_6
	r1.y = imov ssa_7.x
	vec4 32 ssa_9 = tex r1 (coord), 0 (texture) 0 (sampler)
	vec4 32 ssa_10 = fmul r0, ssa_9
	intrinsic store_output (ssa_10, ssa_0) () (0, 15, 0) /* base=0 */ /* wrmask=xyzw */ /* component=0 */	/* gl_FragColor */
	/* succs: block_0 */
	block block_0:
}

========prog========
-------block------
const 0 ssa0
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      mov 11 reg1
        ld_var 9 ssa6
      mov 12 reg1
        ld_var 10 ssa7
====================
ppir: ppir_lower_texture create load_coords node 16 for 13
========prog========
-------block------
st_col 15 new
  mul 14 ssa10
    mov 5 reg0
      ld_var 1 ssa1
    mov 6 reg0
      ld_var 2 ssa2
    mov 7 reg0
      ld_var 3 ssa3
    mov 8 reg0
      ld_var 4 ssa4
    ld_tex 13 ssa9
      ld_coords 16 new
        mov 11 reg1
          ld_var 9 ssa6
        mov 12 reg1
          ld_var 10 ssa7
====================
ppir: node_to_instr create move 17 from store 15
ppir: insert_load_tex: create move 18 for 13
======ppir instr list======
      vary texl unif vmul smul vadd sadd comb stor const0|1
*000: null null null 14   null 17   null null null | 
 001: null null null null null null 5    null null | 
 002: 1    null null null null null null null null | 
 003: null null null null null null 6    null null | 
 004: 2    null null null null null null null null | 
 005: null null null null null null 7    null null | 
 006: 3    null null null null null null null null | 
 007: null null null null null null 8    null null | 
 008: 4    null null null null null null null null | 
 009: 16   13   null null null 18   null null null | 
 010: null null null null null null 11   null null | 
 011: 9    null null null null null null null null | 
 012: null null null null null null 12   null null | 
 013: 10   null null null null null null null null | 
------------------------
======ppir instr depend======
[0[1[2]][3[4]][5[6]][7[8]][9[10[11]][12[13]]]]
------------------------
ppir: ppir: regalloc fail

anarsoul avatar May 25 '18 06:05 anarsoul

In fact, even we don't recombine scalar to vector, ppir should not fail, but only generate longer code. This "regalloc fail" is indeed the ppir need to implement reg spill when out of regs.

yuq avatar May 25 '18 07:05 yuq

As far as I understand we need to reverse engineer how temporaries work first. I see store and load temporary instructions in the doc, but I don't understand where they're stored.

anarsoul avatar May 25 '18 07:05 anarsoul

I guess it's here: https://github.com/yuq/mesa-lima/blob/lima-18.0/include/drm-uapi/lima_drm.h#L107

Each PP can have a memory stack which I guess is used to store tmp.

yuq avatar May 25 '18 07:05 yuq

There're 2 stack_address, one in lima_pp_frame_reg, another in drm_lima_m400_pp_frame/drm_lima_m450_pp_frame.

I'm not sure what's the difference between them.

anarsoul avatar May 25 '18 17:05 anarsoul

lima_pp_frame_reg one is dummy, drm_lima_m400_pp_frame one is used, one for each PP.

yuq avatar May 25 '18 23:05 yuq

@yuq how do you tell which one is dummy?

anarsoul avatar May 26 '18 07:05 anarsoul

In this function: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L303

LIMA_PP_FRAME & LIMA_PP_STACK are per PP, so the lima_pp_frame_reg will be set to new value before task start.

yuq avatar May 26 '18 12:05 yuq

@yuq, what about LIMA_PP_STACK_SIZE?

anarsoul avatar May 26 '18 18:05 anarsoul

And why LIMA_PP_STACK is not used for mali450?

anarsoul avatar May 26 '18 18:05 anarsoul

LIMA_PP_STACK is used by mali450 here: https://github.com/yuq/linux-lima/blob/lima-4.17-rc4/drivers/gpu/drm/lima/lima_pp.c#L321

Just because bcast will set same address to all PPs, so I have to set it individually for each PP.

LIMA_PP_STACK_SIZE is same for all PPs, so the lima_pp_frame_reg one is not dummy.

yuq avatar May 27 '18 00:05 yuq

I'll have some time to work on lima again starting by the end of this week. If nobody is currently working on this, I could pick it up and work on it.

From what I understand there are two issues,

  1. recombine scalars to vec4 to restore the earlier behavior and
  2. implement register spilling to avoid this kind of problem in the future

Is that correct? Anybody working in any of these?

Just to be sure though, are we sure already that we want (1) (as in the title of this issue), even with (2) implemented? Or is it something that we would need to benchmark?

enunes avatar Jun 05 '18 02:06 enunes

I think you are right. 2 is needed anyway for correctness and 1 is for better performance. But 1 needs to be done also because I wrote ppir for vec4, not for scalar, some refine or recombine may be needed for correctness.

yuq avatar Jun 05 '18 02:06 yuq