data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Bug] Semantic sort for repos doesn't seem to do as described

Open MisterKloudy opened this issue 4 months ago • 0 comments

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/Other, Other

What happened + What you expected to happen

Hi IBM Team,

Thank you very much for open-sourcing your repository and working hard to make it modular and reusable.

I ran the following code for repo-level semantic ordering and faced a few problems.

import os
import pyarrow.parquet as pq
from data_processing.utils import ParamsUtils
from data_processing.data_access import DataAccessLocal
from repo_level_ordering.ray.src.repo_level_order_transform import RepoLevelOrderRayTransformConfiguration

# create parameters
input_folder = os.path.abspath("sample_1000_rows")
output_folder = os.path.abspath("repo_level_output")

local_conf = {
    "input_folder": input_folder,
    "output_folder": output_folder,
}

worker_options = {"num_cpus": 0.8}
code_location = {"github": "github", "commit_hash": "12345", "path": "path"}
params = {
    # where to run
    "run_locally": True,
    # Data access. Only required parameters are specified
    "data_local_config": ParamsUtils.convert_to_ast(local_conf),
    # orchestrator
    "runtime_worker_options": ParamsUtils.convert_to_ast(worker_options),
    "runtime_num_workers": 1,
    "runtime_pipeline_id": "pipeline_id",
    "runtime_job_id": "job_id",
    "runtime_creation_delay": 0,
    "runtime_code_location": ParamsUtils.convert_to_ast(code_location),
}


repo_level_params = {
    "repo_lvl_sorting_algo": "SORT_SEMANTIC_NORMALISED",
    "repo_lvl_store_type": "ray",
    "repo_lvl_store_backend_dir": "./mystore",
    "repo_lvl_language_column": "language",
    "repo_lvl_sorting_enabled": True,
     "repo_lvl_output_by_langs": True,
     "repo_lvl_combine_rows": True,
}

if __name__ == "__main__":
    # Set the simulated command line args
    sys.argv = ParamsUtils.dict_to_req(d=params | repo_level_params)
    # create launcher
    launcher = RayTransformLauncher(RepoLevelOrderRayTransformConfiguration())
    # Launch the ray actor(s) to process the input
    launcher.launch()

I have a few questions/problems which surfaced after I tried running the above code. (1) The following columns are required but not mentioned in the readme: contents, ext, document_id (2) What is the code_location dictionary meant to do? (3) Changing the repo_lvl_output_by_langs and repo_lvl_combine_rows seems to not do anything. (4) The semantic sort seems to be based on the depth of the path and not on the dependencies/imports within the contents as described in the Granite long context paper. Is this the case? In the test data I also noticed that robotics-paper_ark2022_3T1R-master/run_evaluation.m is placed before all of the dependences which it calls for evaluation on, e.g. robotics-paper_ark2022_3T1R-master/dimsynth/robot_names.m and this behaviour is also consistent with Python repos I'm testing on which lists the main.py before the dependencies in deeper folders.

Reproduction script

transforms/code/repo_level_ordering/ray/test-data/expected/repo2.parquet

Anything else

No response

OS

Ubuntu

Python

3.10.x

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

MisterKloudy avatar Oct 04 '24 09:10 MisterKloudy