Daya Guo
Daya Guo
You don't have to collect questions. Some entities like "唐朝" or "李白" also work well.
We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them...
only for python, java, c#, c and c++
just the above languages. Other languages employ file level dedup.
Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.
We will use public datasets like [RepoCoder](https://arxiv.org/pdf/2303.12570.pdf) and [CrossCodeEval](https://crosscodeeval.github.io/) to evaluate.
First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until...
Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph...
The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is,...
> Regarding repo-level concatenation, I have a related question. > > In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating...