code2vec
code2vec copied to clipboard
How to create code embeddings from Java codebase and store it in a vector database?
Hi there team code2vec,
I am working on a personal project. My aim is to store a Java codebase in a vector database to run similarity searches and retrieve code files from the db relevant to my query. Queries can be of the type:
- Method creating database pool connection.
- Entity class linked to 'Subjects' table
Basically a query will be an activity performed by the codebase and I should return the package, classname, (and method if required).
My plan is to vectorize these search queries using a vectorizer present in your codebase, perform similarity search and return results.
My questions are:
- How can I generate vectors for Java code using a your pretrained model?
- Will it be a good idea to vectorize an English query for similarity search?
Hi @shankernamami , Thank you for your interest in our work!
See this part of the README: https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples
See also these newer models/papers:
- https://huggingface.co/neulab/codebert-java
- https://arxiv.org/pdf/2207.05987.pdf
Best, Uri
@urialon Thank you! this answers my questions : )
Hi @shankernamami , Thank you for your interest in our work!
See this part of the README: https://github.com/tech-srl/code2vec#exporting-the-code-vectors-for-the-given-code-examples
See also these newer models/papers:
- https://huggingface.co/neulab/codebert-java
- https://arxiv.org/pdf/2207.05987.pdf
Best, Uri
Hi I have used the same command indicated on the ReadMe link which is "-export_code_vectors". However doing so gives me the following error:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Expect 201 fields but have 2 in record
[[node IteratorGetNext (defined at /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1751) ]]
My command was
python3 code2vec.py --export_code_vectors --test new-data/test/AdministeredCommentsDto.cs --load models/csharp14m/saved_model_iter173.release
where "new-data/test/AdministeredCommentsDto.cs" is the path to the code snippet whose embeddings I am trying to create. I guess I am unable to determine the correct input file type. Guidance in this will be highly appreciated.
Thanks
Hi @asyed79gatech , Thank you for your interest in our work.
I believe that you haven't run the preprocess.sh script on the data.
However in general, I recommend using the newer https://github.com/neulab/code-bert-score project. It is based on Huggingface, which is actively maintained.
Best, Uri