data-prep-kit
data-prep-kit copied to clipboard
[Feature] Add functionality to organise files by repository structure
Search before asking
- [X] I searched the issues and found no similar issues.
Component
Transforms/Other
Feature
Add a new module to organise code files using information from structure of a repo
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
To develop Repo-Level Ordering transform for data-prep-kit, it is seen that we require the following approach.
- iterate over all files, run groupby and make a list of all repos.
- update the reponame and list of files where it is found to a
store storein the above line represents a data structure similar to key/value store: [str, List[str]]- After processing all files and populating the
store. We need to iterate on the keys of the store and for each repo/key we need to read the files corresponding to that repo-key, filter by key.repo and save only those rows into output.
So above algo has two stages:
- Stage 1: populating files per repo to a store.
- Stage 2:cread list of files from store and filter according to repo.
Our store in this specific use case has only writes in Stage 1,
And only reading in Stage 2
So, We can have 3 approaches to implement the store.
-
- Using Ray Object store (Multiactor), uses memory of nodes and is contrained by memory available on the cluster, network.
-
- Using Filesystem/S3 as backend to store. (folders as keys and files as list of values), constrained by network
-
- Using external store, etcd etc
For processing large data as in our case, we can go for 2nd approach.
@shivdeep-singh-ibm I agree the option i is constrained by the Ray object store memory & option iii introduces new service requirement which is associated with service setup & etc. IMHO option ii is best solution for this multi stage processing though we may see multiple read & writes to external storage
I am sorry, I am missing something here. What exactly are we trying to produce here?
@blublinsky There is a code transform requirement which runs a groupby on data with respect to repo_name column and then runs a sorting_algorithm ( semantic_sort or sort_by_filename) on the grouped data. It writes output data into 1 parquet per repo.
So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet
So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet
Yes, It needs a repository name, repo_name. It is expected to be in the data. As of now this feature is not in code2parquet, but I think it should somehow come from it.
The other issues linked to this development are :