DeepSeek-Coder Repo level concatenation of data

trafficstars

Can you share more details on the technique for repo level concatenation part?

Nov 21 '23 15:11 Bytes-Explorer

We first parsing the dependencies of files, e.g. A->B, B->C, B->D. And then we rearrange the file positions based on their dependencies, e.g. A,B,C,D. For file paths, we add them on each code file as a comment. An example is shown in https://github.com/deepseek-ai/DeepSeek-Coder#4-repository-level-code-completion

Nov 24 '23 15:11 guoday

Thank you for your response. Is this done for all languages in the data?

Nov 27 '23 05:11 Bytes-Explorer

only for python, java, c#, c and c++

Nov 27 '23 06:11 guoday

Thank you @guoday

Nov 27 '23 06:11 Bytes-Explorer

@guoday Do you then do repo level dedup for all programming languages or just the above languages?

Nov 27 '23 14:11 Bytes-Explorer

just the above languages. Other languages employ file level dedup.

Nov 28 '23 04:11 guoday

@guoday Thank you for your prompt responses. I was curious if you did any ablation studies/evaluations to understand if repo level concatenation helped the model performance in a significant way.

Nov 28 '23 08:11 Bytes-Explorer

Not yet. We will try to evaluate the model on repo-level benchmark. For function-level benchmark, the repo level concatenation doesn't help or hurt the model performance.

Nov 28 '23 12:11 guoday

Do you have your own repo level benchmark or use a standard one?

Nov 28 '23 12:11 Bytes-Explorer

We will use public datasets like RepoCoder and CrossCodeEval to evaluate.

Nov 28 '23 12:11 guoday

Ok thanks, was aware of those. Once again, appreciate your prompt responses. I look forward to reading the technical report from your group. Thanks!

Nov 28 '23 13:11 Bytes-Explorer

temp

Hello, I would like to know the details of the concatenation of data. Assume that the structure of parsed dependencies is in the picture, what is the concatenation results? Is it ACF,ADF,ADG,BCF,BDF,BDG,BE? 7 pieces?

Nov 29 '23 08:11 Casi11as

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

Nov 29 '23 08:11 guoday

First, we select the file with the smallest incoming degree, and if there are multiple files with the smallest incoming degree, we randomly choose one. This process is repeated until a dependency order is obtained. For your example, there are many possibilities, one of which could be BACDFGE.

In other words, will all the files of the same language in a repo only concatenate one sample?

Nov 29 '23 09:11 Casi11as

Theoretically, yes. However, to shorten the sample length, we will parse a repository in advance and then divide it into multiple independent subgraphs based on dependencies, with each independent subgraph regarding as a sample.

Nov 29 '23 09:11 guoday

Thanks ! So what are the rules for dividing into subgraphs? Taking the picture I posted above as an example, what sub-pictures will it be divided into?

Nov 29 '23 09:11 Casi11as

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

Nov 29 '23 09:11 slamandar

The term "independent subgraph" refers to a weakly connected subgraph. First, convert the directed graph into an undirected graph, and then divide the graph into multiple connected subgraphs. That is, in each subgraph, any two vertices should be connected by edges within the subgraph. In your example, it is a connected subgraph, with only one subgraph, which is itself. The following is the code to divide the graph into subgraphs.

# convert the directed graph into an undirected graph
def to_undirected(graph):
    undirected_graph = defaultdict(set)
    for node in graph:
        undirected_graph[node]
        for neighbor in graph[node]:
            undirected_graph[node].add(neighbor)
            undirected_graph[neighbor].add(node)
    return undirected_graph

# Use DFS to find all connected subgraphs.
def dfs(graph, node, visited, subgraph):
    visited[node] = True
    subgraph.add(node)
    for neighbor in graph[node]:
        if not visited[neighbor]:
            dfs(graph, neighbor, visited, subgraph)

# obtain all subgraphs
def get_subgraphs(graph):
    undirected_graph = to_undirected(graph)
    visited = {node: False for node in undirected_graph}
    subgraphs = []
    for node in undirected_graph:
        if not visited[node]:
            subgraph = set()
            dfs(undirected_graph, node, visited, subgraph)
            subgraphs.append(subgraph)
    return subgraphs

Nov 29 '23 09:11 guoday

Regarding repo-level concatenation, I have a related question.

In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a?

If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

Nov 29 '23 09:11 guoday

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in https://github.com/deepseek-ai/DeepSeek-Coder/issues/43#issuecomment-1831433765, the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

Nov 29 '23 09:11 guoday

Thank you for your prompt and detailed response!

One last question.

In one sample, is there a need for special token between the concatenated files? So that the model can distinguish that there are multiple files, and avoid model generate code like "import package" after the main content, in some downstream scenarios.

Nov 29 '23 10:11 slamandar

In fact, special token is required. However, we incorporate comments such as #utils.py and #model.py before each file to indicate to the model that the code completion is at the repository level.

Nov 29 '23 11:11 guoday

Completely understand. Thanks again for your quick response!

Nov 30 '23 02:11 slamandar

@guoday I was also wondering what do you do to the other files, like build files or metadata files? Thanks

Nov 30 '23 05:11 Bytes-Explorer

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

Dec 04 '23 10:12 vaisaxena

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

Dec 18 '23 02:12 dongs0104

Regarding repo-level concatenation, I have a related question. In a batch, one sample may contain multi docs from different files, such as repo_a/file_a and repo_a/file_b. When concatenating these files into one sample for pre-training and calculating loss, will there still be an attention mask to prevent file_b from attending file_a? If there is an attention mask, how can it serve the purpose of capturing the repository context? If there isn't, training by simply concatenating the beginning and end of different files seems somewhat peculiar.

If file_b depends on file_a, why is there a need for an attention mask to prevent file_b from attending file_a? Conversely, if file_b doesn't depend on file_a, we wouldn't concatenate these files into a single sample.

In specific scenarios, such as the one described in #43 (comment), the input sequence is BACDFGE. In this case, even though file E does not directly depend on file A, the file E is allowed to attend the file A. This approach enables the model to effectively utilize contextual information from various files within the repository, thereby enhancing its overall comprehension and performance.

@guoday But Node A and B is not connected in-direct graph so i have question about it what if A and B contain simair contents? did you get get_subgraphs and then re order it by repo level again?

Nodes A and B are connected in an undirected graph, indicating that they are the same input sequence. If A and B have similar contents, B can leverage the content of A as additional context to enhance the completion process (sssuming in the sequence, B follows A). We do not re-order these nodes.

Dec 18 '23 07:12 guoday

Truly remarkable work! I am curious about the advantages of repo concatenation in your training process. Do you first pre-train using file-level code (at 4K window), and then continue-train with repo-level code (at 16K window)? What if pre-training using repo-level code at 4K window first?

Jan 09 '24 08:01 reignianor

@guoday Thanks for the details above. It was quite helpful. One follow up question.

Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Jan 15 '24 01:01 zte-tcb

@guoday Thanks for the details above. It was quite helpful. One follow up question. Do you take care of cycles that may appear in the dependency graph of the files ? How do you handle that ? This is the case where A->B, B->C, C->A

I have the same doubts.

Actually, the dependencies of file within the same repository could be represented as a DAG? and it's impossible for the case as your show for A->B, B->C, C->A, which would cause a circular reference problem

Jan 20 '24 03:01 juncaofish

DeepSeek-Coder DeepSeek-Coder copied to clipboard

Repo level concatenation of data

DeepSeek-Coder
DeepSeek-Coder copied to clipboard