Refactor the code_quality transform to follow the new transform style
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transforms/code_quality
Feature
Refactor the codebase for code_quality using the new format by following Data Prep Kit- Developer Tutorial
Are you willing to submit a PR?
- [x] Yes I am willing to submit a PR!
Hi @touma-I, I created a issue here for code_quality conversion. Here is the new structure https://ibm.ent.box.com/notes/1811891238254. I will try to complete an initial conversion this week. (CC @issei-ibm)
Refactoring steps
-
[x] Step 1: Identify 12 quality-related annotations (from
readme.txt)- Line annotation:
line_mean: average line lengthline_max: longest line lengthtotal_num_lines: total linesavg_longest_lines: average of the top-n longest lines
- Character/Token annotation:
calculate_char_token_ratio: number of characters / number of tokensalphanum_frac: fraction of alphanumeric characters in the sample
- Keyword detection:
autogenerated: check if file is autogenerated (using keywords in the first few lines)config_or_test: check if file is a config or test file
- Heuristics:
has_no_keywords: file has none of: function, class, for loop, while loophas_few_assignments: file uses '=' less than a definedminimum
- Text Format:
is_xml: check if input is XML contentis_html: check if input is HTML based on text-to-code ratio
- Line annotation:
-
[x] Step 2: Refactor
code_quality_transform.pyintodpk_code_quality/transform.py- No deep changes made
-
[x] Step 3-4: Refactor
CodeQualityTransformConfigurationintodpk_code_quality/runtime.py- Added three classes:
CodeQualityConfigurationCodeQualityRuntimeCodeQuality
- Added three classes:
-
[x] Step 5: Implement RayTransformRuntimeConfiguration
- Added two classes:
CodeQualityRayTransformConfigurationCodeQuality
- Added two classes:
-
[x] Step 6: Develop test for Python (not for Ray)
-
[ ] Step 7: Makefile
- Tutorial does not provide a complete implementation
- Help needed
-
[x] Step 8: Create Readme file
Hi @touma-I @shahrokhDaijavad here is the PR https://github.com/data-prep-kit/data-prep-kit/pull/1170 of my initial refactoring for code_quality.
I would not expect the current PR to pass all the tests :) because several components like makefile and probably KFP related part needs your help. These parts are sort of beyond my knowledge. Thanks a lot!
@ian-cho Thanks for your work so far. You are right that the tutorial does not provide information about the makefile. We need to add that information, but for now, I think you can mimic https://github.com/data-prep-kit/data-prep-kit/blob/dev/transforms/universal/tokenization2arrow/Makefile that uses all default command line parameters (like you do) to create 2 targets for the make (python and ray runtimes).
merged in PR #1191