code2vec icon indicating copy to clipboard operation
code2vec copied to clipboard

How is AST path converted to vector?

Open abitrolly opened this issue 3 years ago • 10 comments

Came here from https://youtu.be/EJ8okcxL2Iw?t=426

The talk says there are token vectors and path vectors.

I know what an AST is, but I am not that proficient in AI to answer what is the token vector? Is it a list of all tokens encountered? Or a list of all possible combinations of token pair?

It also not clear what a path vector is. If I understand it right, then vector should contains numbers, not symbols from AST path. So how does AST path is converted to path vector? Does the path vector include leaves, which are tokens/symbols?

abitrolly avatar Jul 28 '22 18:07 abitrolly

Hi @abitrolly , Thank you for your interest in our work!

Every unique token and unique path in the training data is allocated with a randomly initialized vector, and we remember the mapping between every token/path into its ID (an integer), and between every ID to the allocated vector. Initially, these vectors are meaningless (since they are random), but then, these vectors are trained as part of the neural network.

This is a common practice in neural networks, also known as "word embeddings". I recommend also reading the paper: https://urialon.cswp.cs.technion.ac.il/wp-content/uploads/sites/83/2018/12/code2vec-popl19.pdf

Best, Uri

urialon avatar Jul 28 '22 20:07 urialon

Hi Uri. Thanks for the fast reply!

I am not really familiar with embeddings, so while reading the paper right now, the reference to "continuous vectors for representing snippets of code" doesn't make it more clear. Need some low level example like requested in #102 to get a picture of what this "code embedding" is composed of. And a 3D movie of the process. :D

It is also interesting to see how "a parallel vocabulary of vectors of the labels" looks like, and how does his vector arithmetic with words actually works.

That's already a lot of questions, and I haven't even started with chapter "1.1 Applications" :D So I better skip directly to "2.1 Motivating Example" which contains the following explanation.

Path extraction. First, each query method in the training corpus is parsed to construct an AST. Then, the AST is traversed and syntactic paths between AST leaves are extracted. Each path is represented as a sequence of AST nodes, linked by up and down arrows, which symbolize the up or down link between adjacent nodes in the tree. The path composition is kept with the values of the AST leaves it is connecting, as a tuple we refer to as a path-context.

Right now I understand the process as following.

graph LR;
.java --> AST --> L["leaves and paths"] --> E["vector AKA code embedding"];

Chapter 3 gives some hard mathematical abstractions. Thankfully I played a bit with https://docs.python.org/3/library/ast.html to modify Python sources with Python, so I won't break my head on formal AST definition and can skip over. :D At the end there is a useful example of "leaves and paths".

This expression.

x = 7

Is transformed into leaves x and 7 and the path (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr ) between them.

x, (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr ), 7

So this thing ^^^ is called Path-Context.

The paper then gives some limitations.

To limit the size of the training data and reduce sparsity [...] we limit the paths by maximum length [...] and limit the maximum width - the maximal difference in child index between two child nodes of the same intermediate node. These values are determined empirically as hyperparameters of our model.

This needs example. Like the length of AST is actually depth of nesting instructions, and the width is the difference between line number in the same method without nesting. But I may be wrong here.

"values are determined empirically as hyperparameters of our model" breaks my mind. Is it about that parameters are calculated during dataset preprocessing to keep dataset small?

Finally chapter "4 MODEL" explains high level view and jumps directly to "a Bag of Path-Contexts", while I am still puzzled, how a single Path-Context consisting of mathematical symbols is represented in data on low level. For example, this one.

x, (NameExpr ↑ AssignExpr ↓ IntegerLiteralExpr), 7

How it looks in Python code?

And then, if I understand it right, everything in this tuple needs to be a digit, because Python do not understand math symbols like these, and because to train model, there should be arithmetic operations there. And that tuple with digits is what I meant by vector while writing the title for this issue.

There are a lot questions about the bag, about the size of every-to-every dict of leaves, what is "node" and "itself" in "pairs that contain a node and itself", but that falls out of scope for this question.

Originally I was going about to ask, if the vector is this.

id, token1, token2, vector
---------------------------------
1,  1,      2,      [1,2,3]

Where token1 is leaf and the number is the index in leaves table. And vector numbers are indexes in a table that lists all syntax constructs. So for x =1 the tables will be.

id, token
---------
1,  x
2,  7
id, syntax
----------
1,  NameExpr_UP
2,  AssignExpr_DOWN
3,  IntegerLiteralExpr

abitrolly avatar Aug 01 '22 06:08 abitrolly

Hi @abitrolly , Sorry for the delayed response.

I am not sure where to begin. Do you have any concrete questions? Did you have a chance to look at the video?

I also recommend basic neural-NLP lectures such as Stanford's course.

Best, Uri

urialon avatar Aug 11 '22 20:08 urialon

I am still trying to find a time to read the paper till the end figure out what are embedding. How text fields are of path-context are converted into numerical values for calculations.

abitrolly avatar Aug 11 '22 21:08 abitrolly

Did you have a chance to look at the video?

Yes. I came here after watching it.

abitrolly avatar Aug 11 '22 21:08 abitrolly

The main idea is to assign a random vector to every kind of symbol, whether that symbol is a path or a token.

Then, during training, these vectors are modified such that the loss term that we define is minimized. That's basically how neural network are trained, in a nutshell.

urialon avatar Aug 12 '22 00:08 urialon

Aha. That makes it more clear, thanks. Now I need to find the place in code, where the assignment of random vector takes place and see how these symbols are represented in Python.

Then the question would be, what algorithm is used to adjust vector weights during training? But that's probably already covered by paper.

abitrolly avatar Aug 12 '22 07:08 abitrolly

This is where the random vectors are initialized:

https://github.com/tech-srl/code2vec/blob/master/tensorflow_model.py#L206-L220

The reader converts string inputs into integer indices, and these integer indices allow looking up the specific random vector: https://github.com/tech-srl/code2vec/blob/master/path_context_reader.py#L205-L207

The algorithm that is used to adjust vector weights during training is stochastic gradient descent + backpropagation, but that's a very common idea in training neural networks, that is not unique to our work. This idea is covered in almost any neural networks tutorial / online course.

Best, Uri

urialon avatar Aug 12 '22 14:08 urialon

Writing down permalinks to avoid reference drift.

https://github.com/tech-srl/code2vec/blob/e7547de9febe969cde7c64f20ebaa0ca5f4673b6/tensorflow_model.py#L206-L220

Found the docs https://www.tensorflow.org/api_docs/python/tf/compat/v1/get_variable. So for example path_vocab is Tensorflow variable named PATHS_VOCAB with shape (???, 128), ??? is probably the length of vocabulary in lines. Vocabulary is loaded here https://github.com/tech-srl/code2vec/blob/e7547de9febe969cde7c64f20ebaa0ca5f4673b6/vocabularies.py#L183-L184 and 128 is the width in floats for each path-context.

The puzzling thing here is that it seems there is not 1:1 mapping between AST path components and floats. And even if AST path is shorter, it still gets 128 width vector. So the network doesn't really have granularity at the AST level. Paths are treated just as whole strings. I am right?

https://github.com/tech-srl/code2vec/blob/e7547de9febe969cde7c64f20ebaa0ca5f4673b6/path_context_reader.py#L205-L207


With gradient descent, the goal is to find some minimum output value for a given set of inputs. So here I guess the output value is the name of a function. During training symbols and paths that are close to each other receive similar weighs, which can serve as some coordinates in 3D (or whatever-D space), and that allows to query the space for different properties. This is my understanding of it so far.

abitrolly avatar Aug 12 '22 17:08 abitrolly

So the network doesn't really have granularity at the AST level. Paths are treated just as whole strings

Thus is correct, and was addressed in a follow-up paper Code2seq .

the goal is to find some minimum output value for a given set of inputs. So here I guess the output value is the name of a function.

Yes, but the minimum value that we're trying to find is the negative (log-) probability of predicting the right method name.

During training symbols and paths that are close to each other receive similar weighs, which can serve as some coordinates in 3D (or whatever-D space), and that allows to query the space for different properties.

Thus sounds almost correct, but I cannot completely confirm since I don't exactly understand what you mean by "query the space for different properties"

ן

urialon avatar Aug 13 '22 02:08 urialon