data-prepper icon indicating copy to clipboard operation
data-prepper copied to clipboard

Refactored data clumps with the help of LLMs (research project)

Open compf opened this issue 10 months ago • 4 comments

Description

Hello maintainers,

I am conducting a master thesis project focused on enhancing code quality through automated refactoring of data clumps, assisted by Large Language Models (LLMs).

Data clump definition

A data clump exists if

  1. two methods (in the same or in different classes) have at least 3 common parameters and one of those methods does not override the other, or
  2. At least three fields in a class are common with the parameters of a method (in the same or in a different class), or
  3. Two different classes have at least three common fields

See also the following UML diagram as an example Example data clump

I believe these refactoring can contribute to the project by reducing complexity and enhancing readability of your source code.

Pursuant to the EU AI Act, I fully disclose the use of LLMs in generating these refactorings, emphasizing that all changes have undergone human review for quality assurance.

Even if you decide not to integrate my changes to your codebase (which is perfectly fine), I ask you to fill out a feedback survey, which will be scientifically evaluated to determine the acceptance of AI-supported refactorings. You can find the feedback survey under https://campus.lamapoll.de/Data-clump-refactoring/en

Thank you for considering my contribution. I look forward to your feedback. If you have any other questions or comments, feel free to write a comment, or email me under [email protected] .

Best regards, Timo Schoemaker Department of Computer Science University of Osnabrück

Check List

  • [ ] New functionality includes testing.
  • [ ] New functionality has a documentation issue. Please link to it in this PR.
    • [ ] New functionality has javadoc added
  • [ x] Commits are signed with a real name per the DCO

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license. For more information on following Developer Certificate of Origin and signing off your commits, please check here.

compf avatar Apr 19 '24 11:04 compf

@compf Thanks a lot for your contribution. It is an interesting innovation project and idea to apply LLM to improve our project code quality. I have the following questions:

  1. What is the LLM you have applied? any reference or links?
  2. For long term (not necessary in this PR), is it possible to integrate it into GitHub automated CI/CD workflow?

chenqi0805 avatar May 13 '24 16:05 chenqi0805

@compf Excellent! Thank you very much for your contribution. Will take a look at the diff and provide the feedback.

kkondaka avatar May 13 '24 17:05 kkondaka

@compf Thanks a lot for your contribution. It is an interesting innovation project and idea to apply LLM to improve our project code quality. I have the following questions:

  1. What is the LLM you have applied? any reference or links?
  2. For long term (not necessary in this PR), is it possible to integrate it into GitHub automated CI/CD workflow?

Thank you for the questions

  1. For your project I used GPT-4-1106 from OpenAI
  2. This is not the main part of my master thesis. However, the general idea of my project should be integrable to Github Action or similar CI/CD processes. The issue is that using LLM very often leads to uncompileable code that must be fixed by a human in a loop. Currently, The only reliable integration would be to only suggest the name of the extracted class by an LLM and perform the refactoring by another tool. But I am hopeful that in the future, LLMs will be improved to better handle the refactoring

compf avatar May 14 '24 23:05 compf

Thanks for the contribution. I am a little doubtful on this refactoring. It creates a new very abstract class OperationParameters, that is just a container of two abstract maps. I outlined, what functions would introduce some more meaning for me.

A possible extension would be, to wrap BiFunction<Object, Object, Number> and similar into functional interfaces:

@FunctionalInterface
interface Operation extends BiFunction<Object, Object, Number> {

These kind of extensions would make the new class OperationParameters more readable aiding the understanding of the code. Just the extraction of the class falls a little short in my opinion.

Thank you very much for the feedback :)

compf avatar May 14 '24 23:05 compf