CodeGen icon indicating copy to clipboard operation
CodeGen copied to clipboard

How to add new lanuages

Open hxue3 opened this issue 3 years ago • 8 comments

I am wondering if I can add new lanuages for code translation, for example I want to translate COBOL code to python

Do you have any tips if you can explain briefly that what I should do

hxue3 avatar Oct 20 '21 17:10 hxue3

I'll detail the steps you'd need to take to translate from COBOL to python:

  • Get monolingual COBOL and python data using google bigquery or something else
  • preprocess your data using our preprocess.py pipeline. You will need to add COBOL as a supported language and a CobolProcessor object inheriting from LangProcessor with at least tokenize_code, detokenize_code and extract_functions methods. The methods to tokenize and detokenize code may not require any work if you inherit from TreeSitterLangProcessor and use something like this: https://github.com/Neppord/tree-sitter-cobol. Check JavaProcessor for an example of what to do. Note that your tokenize_code method needs at least to return something on a single line for the pipeline to work but the rest is not completely necessary. You can also try to translate at file level and in that case you won't need to be able to extract functions.
  • (Optional) build valid/test datasets for COBOL to Python with a few examples of parallel functions and some test scripts where you can just insert a solution by replacing a #TOFILL comment and run the script. If you don't create the test scripts you can just set --eval_computation false.

Then train a model. At this point the steps are the same as for C++ to Python if you created the datasets properly.

  • Train a MLM model on COBOL and Python
  • Train a TransCoder model on COBOL and Python functions: DAE and BT steps in both directions. If you have no parallel valid/test sets you can create mock ones with just one non-empty line for each or deactivate the MT evaluation in the code.

baptisteroziere avatar Oct 21 '21 15:10 baptisteroziere

@brozi

I need a processor for C. Should I follow the same steps?

raffian avatar Dec 14 '21 22:12 raffian

Yes and you can probably adapt the C++ processor for C https://github.com/facebookresearch/CodeGen/blob/c83433217fdba964d1f15aa4d45a78c75d6bfa12/codegen_sources/preprocessing/lang_processors/cpp_processor.py

baptisteroziere avatar Dec 15 '21 18:12 baptisteroziere

@brozi

Thanks for the quick reply.

Would you be willing to share the queries (BigQuery) used to extract monolingual cpp data from the github public dataset? Whitepaper suggests I'll need to do that for C, but can probably reuse the java dataset under data/test_dataset as-is given java is our target, correct?

raffian avatar Dec 15 '21 20:12 raffian

@raffian @brozi did you ever suss out what the queries should look like? Hoping to leverage this to train a Delphi <-> Python3 transcoder model.

Thanks!

arpieb avatar Mar 20 '23 19:03 arpieb

@arpieb

Unfortunately, no - I gave up on this approach.

I switched my focus to ANTLR4, it's mature, stable, and comes with lots of grammars for parsing nearly every programming language ever created, though surprisingly Delphi is missing from the list. https://github.com/antlr/grammars-v4

I found this one - unofficial I guess so your mileage may vary. https://github.com/gotthardsen/Delphi-ANTRL4-Grammar/blob/master/Delphi.g4

If you go down the ANTLR4 path for code translation, take my advice and do the training at udemy, it's worth it. Understanding ANTLR fundamentals are essential to using it effectively, otherwise, you'll just get frustrated with it. https://www.udemy.com/course/antlr-programming-masterclass-with-python

Good luck, raffian

raffian avatar Mar 20 '23 19:03 raffian

@arpieb

Best place to start with ANTLR: https://www.antlr.org/

raffian avatar Mar 21 '23 15:03 raffian

@raffian thanks for the links, will check them out! really hoping to find something that can intelligently perform the translation, or at least a large part of it.

arpieb avatar Mar 21 '23 20:03 arpieb