WANG Yue comments

Results 38 comments of


                                            WANG Yue

Question about pre-training process

Hi, for each stage (either MSP or NTP task) of pretraining, we employ a small proprotion of training data as the held-out validation set and monitor the corresponding loss (either...

Pretraining Dataset

Hi, I remember that the training portion is not the same as non-valid/test portion and it is actually smaller. You can try to verify this. For data filtering, we might...

How to get embedding for javascript and python code snippet?

Hi, for embedding exaction with CodeT5, we suggest to follow the BART manner by feeding the sequences to both encoder and decoder. Then you can employ either the last decoder...

twitter dataset

You can send me an email to request the twitter dataset and promise do not disclose it publicly (considering the copyright issues). Thanks.

Embeddings for python code with multiple methods

Hi, although CodeT5 is pretrained with single functions, it should be able to transfer to encode multiple methods, which is however not tested yet.

Hi there, we have provide an example [finetuning script](https://github.com/salesforce/CodeT5/blob/main/CodeT5%2B/tune_codet5p_seq2seq.py) and please see [here](https://github.com/salesforce/CodeT5/blob/main/CodeT5+/README.md#how-to-finetune-using-your-own-data) for more details. For bigger models such as 2B and 6B, please use Deepspeed for training acceleration.

WANG Yue

Question about pre-training process

Pretraining Dataset

How to get embedding for javascript and python code snippet?

twitter dataset

Embeddings for python code with multiple methods

Fine-tuning CodeT5+ 2B

Add ImageCoDe: Image Retrieval from Contextual Descriptions

CodeT5+ | Repeated <extra_id_1> in the generated tokens

Generate different representations between Python and Java code

How to get embedding for javascript and python code snippet?