data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Add functionality to organise files by repository structure

Open Bytes-Explorer opened this issue 1 year ago • 7 comments
trafficstars

Search before asking

  • [X] I searched the issues and found no similar issues.

Component

Transforms/Other

Feature

Add a new module to organise code files using information from structure of a repo

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

Bytes-Explorer avatar Jun 17 '24 11:06 Bytes-Explorer

To develop Repo-Level Ordering transform for data-prep-kit, it is seen that we require the following approach.

  • iterate over all files, run groupby and make a list of all repos.
  • update the reponame and list of files where it is found to a store
  • store in the above line represents a data structure similar to key/value store: [str, List[str]]
  • After processing all files and populating the store. We need to iterate on the keys of the store and for each repo/key we need to read the files corresponding to that repo-key, filter by key.repo and save only those rows into output.

So above algo has two stages:

  • Stage 1: populating files per repo to a store.
  • Stage 2:cread list of files from store and filter according to repo.

Our store in this specific use case has only writes in Stage 1, And only reading in Stage 2

So, We can have 3 approaches to implement the store.

    1. Using Ray Object store (Multiactor), uses memory of nodes and is contrained by memory available on the cluster, network.
    1. Using Filesystem/S3 as backend to store. (folders as keys and files as list of values), constrained by network
    1. Using external store, etcd etc

For processing large data as in our case, we can go for 2nd approach.

shivdeep-singh-ibm avatar Jun 20 '24 05:06 shivdeep-singh-ibm

@shivdeep-singh-ibm I agree the option i is constrained by the Ray object store memory & option iii introduces new service requirement which is associated with service setup & etc. IMHO option ii is best solution for this multi stage processing though we may see multiple read & writes to external storage

Param-S avatar Jun 20 '24 06:06 Param-S

I am sorry, I am missing something here. What exactly are we trying to produce here?

blublinsky avatar Jun 23 '24 19:06 blublinsky

@blublinsky There is a code transform requirement which runs a groupby on data with respect to repo_name column and then runs a sorting_algorithm ( semantic_sort or sort_by_filename) on the grouped data. It writes output data into 1 parquet per repo.

shivdeep-singh-ibm avatar Jun 24 '24 04:06 shivdeep-singh-ibm

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

blublinsky avatar Jun 24 '24 07:06 blublinsky

So basically it creates a single arrow table per repository, right? And how does it know the repository name? is it a separate column? Final question. Should it be part of code2parquet

Yes, It needs a repository name, repo_name. It is expected to be in the data. As of now this feature is not in code2parquet, but I think it should somehow come from it.

shivdeep-singh-ibm avatar Jul 18 '24 14:07 shivdeep-singh-ibm