Daya Guo comments

Results 76 comments of


                                            Daya Guo

How collect other topic data?

You don't have to collect questions. Some entities like "唐朝" or "李白" also work well.

Repo level concatenation of data

We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them...

Repo level concatenation of data

only for python, java, c#, c and c++

Repo level concatenation of data

just the above languages. Other languages employ file level dedup.

Repo level concatenation of data

Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.

Repo level concatenation of data

We will use public datasets like [RepoCoder](https://arxiv.org/pdf/2303.12570.pdf) and [CrossCodeEval](https://crosscodeeval.github.io/) to evaluate.

Repo level concatenation of data

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until...

Repo level concatenation of data

Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph...

Repo level concatenation of data

The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is,...

Repo level concatenation of data

> Regarding repo-level concatenation, I have a related question. > > In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating...