Code2Vec

Giving relevant names to variables, functions and methods is one of the most important rule of code quality. Indeed, code is a hybrid language. It should be understandable by a computer AND by a human. So if you name a function ’f’ or ’bubble_sort’ or a variable ’x’ or ’input_list’ a computer would not care, but the next developer that will read your code (might also be you in the future) and will try to understand it will be grateful if you gave meaningful names to your variables and functions. Here, I will present to you my understanding of the code2vec model architecture which allows to effectively embed a code snippet in order to predict a function name.


How to effectively embed a code snippet to predict a function name

Path-attention model

The first step is to parse the code snippet and transform it into an Abstract Syntax Tree. We extract all terminal nodes which are the leaves of the AST. We then create all the paths between these terminal nodes. We now have a collection of tuple in the following format (starting\_node, path, ending\_node), called a path-context. Depending on the computational power and memory available, we can sample the number such tuple we want to keep. The label associated with each collection of tuple is the function name from where the paths are extracted.


Here is an example: 


def add(a, b):

    return a+b


This function is parsed into the following AST:


Terminal nodes are in the blue squares. In red, a context path is highlighted

(a, BINARY_OPERATOR:+|RETURN_STATEMENT|BLOCK|FUNCTION_DECLARATION|PARAMETER, b)

The input of our model will be all the path-contexts that we can extract from this AST and the label associated would be the function name, "add".


Machine learning models only understand numbers, so we have to convert the precedents strings triplets into integers. In order to do this, we create three vocabularies, one for the paths, one for the nodes, one for the labels. We add two special tokens, one for padding and one for unknown words. We sort the nodes name by the number of times they appear in the training dataset, we do the same for the different paths and labels. We define the size of vocabulary we want and only keep the top node's names, paths and labels. The most common word will then be converted to a 1, the second most common word to a 2 etc... Finally we have three mappings from strings to numbers, one for the node's names, one for each path, one for each label.

Embedding

Thanks to three distinct fully-connected layers, the model learns a numeric vector of size d for each of the component of the context-path. Thus, each component is no longer represented by an arbitrary integer but by a vector that can take continuous values. It allows the model to have a much better understanding of the semantic of the words. Words that have similar meanings will have vectors pointing in the same direction.

After that, the three vectors are concatenated into one of size 3*d. We apply one last fully-connected layer to "compress" this vector into our representation of one path-context of size d, named "combined context vector". This last layer allows the model to give more or less attention to the same path depending on its first and last nodes.

Aggregating multiple contexts into a single vector representation with attention

In order to aggregate the information from each path-context , we create a weighted average over all combined context vectors using an attention-mechanism. What this means is that during training, the model will learn to recognise relevant context vectors and will give them more importance. That's it! We have a vector representation of our code snippet! Cool, right?


So now that we have seen the whole architecture of the model, we have to give it an objective so it can learn and have proper weights. Our objective is to predict a function name. We add a last layer of the size of the number of function names we found in our training dataset. Thanks to an application of the softmax function, we now have an output where each row corresponds to the probability that a specific function name should be assigned to the code snippet in input.


In conclusion, I don't think that such a model can replace the work of a skilled developer, but it can be useful to have a tool that can suggest better function names when needed. If you want to learn more about this model go check out https://code2vec.org/!

Start automating your unit tests today