code2vec
code2vec copied to clipboard
code2vec for Python 3 made for NL2ML project
This is my version of code2vec work for Python3. It works only on keras implementation as for now. Some basic changes done:
- Added Jupyter notebook with preprocessing of code snippets
- Support of Python3 code thx to JB miner
- Support of code embeddings (a.k.a. before the last dense layer, witch originally works only for TF implementation)
- Getting target and token embeddings by running .sh
- Getting top 10 synonyms for given label.
The rest of the README is almost the same with the original code2vec, but with some changes considering my implemetation. You should understand that the original work has a lot more opportunities (including already trained on Java models) so I really recommend working with it. Here I leave some dependencies on file and folder names, but anyone can get through them.
Code2vec
A neural network for learning distributed representations of code. This is made on top of the implementation of the model described in:
Uri Alon, Meital Zilberstein, Omer Levy and Eran Yahav, "code2vec: Learning Distributed Representations of Code", POPL'2019 [PDF]
October 2018 - The paper was accepted to POPL'2019!
April 2019 - The talk video is available here.
_July 2019 - Add tf.keras
model implementation
An online demo is available at https://code2vec.org/.
Only keras version for now.

Table of Contents
- Requirements
- Quickstart
- Configuration
- Features
- Citation
Requirements
On Ubuntu:
- Python3 (>=3.6). To check the version:
python3 --version
- TensorFlow - version 2.0.0 (install). To check TensorFlow version:
python3 -c 'import tensorflow as tf; print(tf.__version__)'
- If you are using a GPU, you will need CUDA 10.0 (download) as this is the version that is currently supported by TensorFlow. To check CUDA version:
nvcc --version
- For GPU: cuDNN (>=7.5) (download) To check cuDNN version:
cat /usr/include/cudnn.h | grep CUDNN_MAJOR -A 2
- For creating a new dataset (any operation that requires parsing of a new code example) - JetBrains astminer (their cli is already here
Quickstart
Step 0: Cloning this repository
git clone https://github.com/Kirili4ik/code2vec
cd code2vec
Step 1: Creating a new dataset from java sources
In order to have a preprocessed dataset to train a network on you should create a new dataset of your own. It consists from 3 folders train, test and validation.
Creating and preprocessing a new Python dataset
In order to create and preprocess a new dataset (for example, to compare code2vec to another model on another dataset):
- Edit the file preprocess.sh using the instructions there, pointing it to the correct training, validation and test directories.
- Run the preprocess.sh file:
source preprocess.sh
Step 2: Training a model
You should train a new model using a preprocessed dataset.
Training a model from scratch
To train a model from scratch:
- Edit the file train.sh to point it to the right preprocessed data. By default, it points to my "my_dataset" dataset that was preprocessed in the previous step.
- Before training, you can edit the configuration hyper-parameters in the file config.py, as explained in Configuration.
- Run the train.sh script:
source train.sh
Notes:
- By default, the network is evaluated on the validation set after every training epoch.
- The newest 10 versions are kept (older are deleted automatically). This can be changed, but will be more space consuming.
- By default, the network is training for 20 epochs. These settings can be changed by simply editing the file config.py. You may need lots and lots of data because of the simplicity of the model.
Step 3: Evaluating a trained model
Once the score on the validation set stops improving over time, you can stop the training process (by killing it) and pick the iteration that performed the best on the validation set. Suppose that iteration #8 is our chosen model, run:
python3 code2vec.py --framework keras --load models/my_first_model/saved_model --test data/my_dataset/my_dataset.test.c2v
Step 4: Manual examination of a trained model
To manually examine a trained model, run:
source my_predict.sh
After the model loads, follow the instructions and edit the file Input.py and enter a Python method or code snippet, and examine the model's predictions and attention scores.
Step 5: Getting embeddings
Follow Step 4 and embedding for your snippet will be in EMBEDDINGS.txt file.
Step 6: Look at synonyms
Run command:
python3 my_find_synonim.py --label 'linear|algebra' Or any other tag and look at the closest to it.
Configuration
Changing hyper-parameters is possible by editing the file config.py.
Here are some of the parameters and their description:
config.NUM_TRAIN_EPOCHS = 20
The max number of epochs to train the model. Stopping earlier must be done manually (kill).
config.SAVE_EVERY_EPOCHS = 1
After how many training iterations a model should be saved.
config.TRAIN_BATCH_SIZE = 1024
Batch size in training.
config.TEST_BATCH_SIZE = config.TRAIN_BATCH_SIZE
Batch size in evaluating. Affects only the evaluation speed and memory consumption, does not affect the results.
config.TOP_K_WORDS_CONSIDERED_DURING_PREDICTION = 10
Number of words with highest scores in $ y_hat $ to consider during prediction and evaluation.
config.NUM_BATCHES_TO_LOG_PROGRESS = 100
Number of batches (during training / evaluating) to complete between two progress-logging records.
config.NUM_TRAIN_BATCHES_TO_EVALUATE = 100
Number of training batches to complete between model evaluations on the test set.
config.READER_NUM_PARALLEL_BATCHES = 4
The number of threads enqueuing examples to the reader queue.
config.SHUFFLE_BUFFER_SIZE = 10000
Size of buffer in reader to shuffle example within during training. Bigger buffer allows better randomness, but requires more amount of memory and may harm training throughput.
config.CSV_BUFFER_SIZE = 100 * 1024 * 1024 # 100 MB
The buffer size (in bytes) of the CSV dataset reader.
config.MAX_CONTEXTS = 200
The number of contexts to use in each example.
config.MAX_TOKEN_VOCAB_SIZE = 1301136
The max size of the token vocabulary.
config.MAX_TARGET_VOCAB_SIZE = 261245
The max size of the target words vocabulary.
config.MAX_PATH_VOCAB_SIZE = 911417
The max size of the path vocabulary.
config.DEFAULT_EMBEDDINGS_SIZE = 128
Default embedding size to be used for token and path if not specified otherwise.
config.TOKEN_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for tokens.
config.PATH_EMBEDDINGS_SIZE = config.EMBEDDINGS_SIZE
Embedding size for paths.
config.CODE_VECTOR_SIZE = config.PATH_EMBEDDINGS_SIZE + 2 * config.TOKEN_EMBEDDINGS_SIZE
Size of code vectors.
config.TARGET_EMBEDDINGS_SIZE = config.CODE_VECTOR_SIZE
Embedding size for target words.
config.MAX_TO_KEEP = 10
Keep this number of newest trained versions during training.
config.DROPOUT_KEEP_RATE = 0.75
Dropout rate used during training.
config.SEPARATE_OOV_AND_PAD = False
Whether to treat <OOV>
and <PAD>
as two different special tokens whenever possible.
Features
Code2vec supports the following features:
Releasing the model (not sure)
If you wish to keep a trained model for inference only (without the ability to continue training it) you can release the model using:
python3 code2vec.py --load models/my_first_model/saved_model --release
This will save a copy of the trained model with the '.release' suffix. A "released" model usually takes 3x less disk space.
Exporting the trained token vectors and target vectors
These saved embeddings are saved without subtoken-delimiters ("toLower" is saved as "tolower").
In order to export embeddings from a trained model, use:
source my_get_embeddings.sh
This creates 2 files tokens.txt and targets.txt
This saves the tokens/targets embedding matrices in word2vec format to the specified text file, in which: the first line is: <vocab_size> <dimension> and each of the following lines contains: <word> <float_1> <float_2> ... <float_dimension>
These word2vec files can be manually parsed or easily loaded and inspected using the gensim python package:
python3
>>> from gensim.models import KeyedVectors as word2vec
>>> vectors_text_path = 'models/java14_model/targets.txt' # or: `models/java14_model/tokens.txt'
>>> model = word2vec.load_word2vec_format(vectors_text_path, binary=False)
>>> model.most_similar(positive=['equals', 'to|lower']) # or: 'tolower', if using the downloaded embeddings
>>> model.most_similar(positive=['download', 'send'], negative=['receive'])
Citation
code2vec: Learning Distributed Representations of Code
@article{alon2019code2vec,
author = {Alon, Uri and Zilberstein, Meital and Levy, Omer and Yahav, Eran},
title = {Code2Vec: Learning Distributed Representations of Code},
journal = {Proc. ACM Program. Lang.},
issue_date = {January 2019},
volume = {3},
number = {POPL},
month = jan,
year = {2019},
issn = {2475-1421},
pages = {40:1--40:29},
articleno = {40},
numpages = {29},
url = {http://doi.acm.org/10.1145/3290353},
doi = {10.1145/3290353},
acmid = {3290353},
publisher = {ACM},
address = {New York, NY, USA},
keywords = {Big Code, Distributed Representations, Machine Learning},
}