data-prep-kit
data-prep-kit copied to clipboard
[Bug] Semantic sort for repos doesn't seem to do as described
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other, Other
What happened + What you expected to happen
Hi IBM Team,
Thank you very much for open-sourcing your repository and working hard to make it modular and reusable.
I ran the following code for repo-level semantic ordering and faced a few problems.
import os
import pyarrow.parquet as pq
from data_processing.utils import ParamsUtils
from data_processing.data_access import DataAccessLocal
from repo_level_ordering.ray.src.repo_level_order_transform import RepoLevelOrderRayTransformConfiguration
# create parameters
input_folder = os.path.abspath("sample_1000_rows")
output_folder = os.path.abspath("repo_level_output")
local_conf = {
"input_folder": input_folder,
"output_folder": output_folder,
}
worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
# where to run
"run_locally": True,
# Data access. Only required parameters are specified
"data_local_config": ParamsUtils.convert_to_ast(local_conf),
# orchestrator
"runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
"runtime_num_workers": 1,
"runtime_pipeline_id": "pipeline_id",
"runtime_job_id": "job_id",
"runtime_creation_delay": 0,
"runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}
repo_level_params = {
"repo_lvl_sorting_algo": "SORT_SEMANTIC_NORMALISED",
"repo_lvl_store_type": "ray",
"repo_lvl_store_backend_dir": "./mystore",
"repo_lvl_language_column": "language",
"repo_lvl_sorting_enabled": True,
"repo_lvl_output_by_langs": True,
"repo_lvl_combine_rows": True,
}
if __name__ == "__main__":
# Set the simulated command line args
sys.argv = ParamsUtils.dict_to_req(d=params | repo_level_params)
# create launcher
launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())
# Launch the ray actor(s) to process the input
launcher.launch()
I have a few questions/problems which surfaced after I tried running the above code. (1) The following columns are required but not mentioned in the readme: contents, ext, document_id (2) What is the code_location dictionary meant to do? (3) Changing the repo_lvl_output_by_langs and repo_lvl_combine_rows seems to not do anything. (4) The semantic sort seems to be based on the depth of the path and not on the dependencies/imports within the contents as described in the Granite long context paper. Is this the case? In the test data I also noticed that robotics-paper_ark2022_3T1R-master/run_evaluation.m is placed before all of the dependences which it calls for evaluation on, e.g. robotics-paper_ark2022_3T1R-master/dimsynth/robot_names.m and this behaviour is also consistent with Python repos I'm testing on which lists the main.py before the dependencies in deeper folders.
Reproduction script
transforms/code/repo_level_ordering/ray/test-data/expected/repo2.parquet
Anything else
No response
OS
Ubuntu
Python
3.10.x
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!