data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

Refactor the code_quality transform to follow the new transform style

Open ian-cho opened this issue 8 months ago • 4 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/code_quality

Feature

Refactor the codebase for code_quality using the new format by following Data Prep Kit- Developer Tutorial

Are you willing to submit a PR?

  • [x] Yes I am willing to submit a PR!

ian-cho avatar Mar 24 '25 06:03 ian-cho

Hi @touma-I, I created a issue here for code_quality conversion. Here is the new structure https://ibm.ent.box.com/notes/1811891238254. I will try to complete an initial conversion this week. (CC @issei-ibm)

ian-cho avatar Mar 25 '25 14:03 ian-cho

Refactoring steps

  • [x] Step 1: Identify 12 quality-related annotations (from readme.txt)

    • Line annotation:
      • line_mean: average line length
      • line_max: longest line length
      • total_num_lines: total lines
      • avg_longest_lines: average of the top-n longest lines
    • Character/Token annotation:
      • calculate_char_token_ratio: number of characters / number of tokens
      • alphanum_frac: fraction of alphanumeric characters in the sample
    • Keyword detection:
      • autogenerated: check if file is autogenerated (using keywords in the first few lines)
      • config_or_test: check if file is a config or test file
    • Heuristics:
      • has_no_keywords: file has none of: function, class, for loop, while loop
      • has_few_assignments: file uses '=' less than a defined minimum
    • Text Format:
      • is_xml: check if input is XML content
      • is_html: check if input is HTML based on text-to-code ratio
  • [x] Step 2: Refactor code_quality_transform.py into dpk_code_quality/transform.py

    • No deep changes made
  • [x] Step 3-4: Refactor CodeQualityTransformConfiguration into dpk_code_quality/runtime.py

    • Added three classes:
      • CodeQualityConfiguration
      • CodeQualityRuntime
      • CodeQuality
  • [x] Step 5: Implement RayTransformRuntimeConfiguration

    • Added two classes:
      • CodeQualityRayTransformConfiguration
      • CodeQuality
  • [x] Step 6: Develop test for Python (not for Ray)

  • [ ] Step 7: Makefile

    • Tutorial does not provide a complete implementation
    • Help needed
  • [x] Step 8: Create Readme file

ian-cho avatar Mar 27 '25 05:03 ian-cho

Hi @touma-I @shahrokhDaijavad here is the PR https://github.com/data-prep-kit/data-prep-kit/pull/1170 of my initial refactoring for code_quality. I would not expect the current PR to pass all the tests :) because several components like makefile and probably KFP related part needs your help. These parts are sort of beyond my knowledge. Thanks a lot!

ian-cho avatar Mar 27 '25 06:03 ian-cho

@ian-cho Thanks for your work so far. You are right that the tutorial does not provide information about the makefile. We need to add that information, but for now, I think you can mimic https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/tokenization2arrow/Makefile that uses all default command line parameters (like you do) to create 2 targets for the make (python and ray runtimes).

shahrokhDaijavad avatar Mar 27 '25 18:03 shahrokhDaijavad

merged in PR #1191

swith005 avatar Jun 24 '25 19:06 swith005