Handle extraction of field to var with multiple usage and calls in between

Open PLukas2018 opened this issue 8 months ago • 0 comments

DiffKemp currently does not handle very well refactoring containing extraction of 'expression' (respectively struct field access) to a variable that is used in multiple places with calls to functions in between. An example of refactoring can be found when comparing sched_submit_work function between kernel linux-4.18.0-305.el8 and linux-4.18.0-348.el8.

 static inline void sched_submit_work(struct task_struct *tsk)
 {
+	unsigned int task_flags;
+
 	if (!tsk->state)
 		return;
 
+	task_flags = tsk->flags;
 	/*
 	 * If a worker went to sleep, notify and ask workqueue whether
 	 * it wants to wake up a task to maintain concurrency.
 	 * As this function is called inside the schedule() context,
 	 * we disable preemption to avoid it calling schedule() again
 	 * in the possible wakeup of a kworker and because wq_worker_sleeping()
 	 * requires it.
 	 */
-	if (tsk->flags & (PF_WQ_WORKER | PF_IO_WORKER)) { // in LLVM IR: first load
+	if (task_flags & (PF_WQ_WORKER | PF_IO_WORKER)) { // in LLVM IR: first load
 		preempt_disable();
-		if (tsk->flags & PF_WQ_WORKER)  // in LLVM IR: second load
+		if (task_flags & PF_WQ_WORKER)  // in LLVM IR: reusing first load
 			wq_worker_sleeping(tsk);
 		else
 			io_wq_worker_sleeping(tsk);
 		preempt_enable_no_resched();
 	}
 
 	if (tsk_is_pi_blocked(tsk))
 		return;
 
 	/*
 	 * If we are going to sleep and we have plugged IO queued,
 	 * make sure to submit it to avoid deadlocks.
 	 */
 	if (blk_needs_flush_plug(tsk))
 		blk_schedule_flush_plug(tsk);
 }

The older LLVM IR contains multiple loads and the newer reuses the previous load.

DiffKemp in the past handled this refactoring by replacing the latter load with the previous if it was not stored in the value (field/variable) between the load instructions. This was/is done in maySkipLoad method:

https://github.com/diffkemp/diffkemp/blob/395902b69cb259da633f8b1ec3f6423a53b4191d/diffkemp/simpll/DifferentialFunctionComparator.cpp#L618-L703

This was changed in 307, because such implementation could cause false negatives because it does not consider that struct field value can be changed inside called functions. Currently, if between loads is called a function, the loads are not 'unified'. This is implemented in mayStoreTo function:

https://github.com/diffkemp/diffkemp/blob/395902b69cb259da633f8b1ec3f6423a53b4191d/diffkemp/simpll/Utils.cpp#L879-L894

DiffKemp could potentially handle such kinds of refactorings by updating mayStoreTo to not return true for all call instructions but only for those that have parameter the structure whose field the load instruction access. The problem is that the function can potentially change the field even if the structure is not passed by parameters, because e.g. pointer to the struct/field could be saved in a global variable and the value could be changed in the called function. As mentioned by @viktormalik one possible approach to solve this could be to check (probably recursively) that the called functions do not use the store instruction at all -- this should eliminate some false positives.

Note: The skipping of load instruction should be also done carefully because in some cases even without store/call instruction the load could give a different value if the variable is volatile.

Apr 16 '25 16:04 PLukas2018