NLP for Programming Code

Natural Language Processing is the current hot topic for many machine learning practitioners. With Hugging Face democratizing the access to state-of-the-art pre-trained models and almost daily new advances, it seems like now is the perfect time to get started. The models can find & fix bugs, generate code based on natural language input or even translate between programming languages.

In this blog post, I will give you an overview of NLP Methods for programming code understanding, focused on models for Python.

Code Understanding

The Problem of Data

In 2019 GitHub published CodeSearchNet (Husain et al., 2019), a challenge, benchmark, and dataset for six programming languages consisting of functions with documentation from open source projects on GitHub. The dataset and baseline models focus on Code Search. Self Attention encoding code and search query yielded the best overall result, measured by the Mean Reciprocal Rank between the correct code snippet and 999 distractors.

In March 2021, Microsoft published a benchmark dataset called CodeXGLUE(Lu et al., 2021) that contains ten tasks across 14 different programming languages (including Python, Java, C++). CodeXGLUE is an adaption of the General Language Understanding Evaluation (GLUE) Benchmark to Programming Code. The tasks are categorized into four distinct categories, shown in the table below:

Category Task Description Dataset for Python
Code-Code Clone Detection Semantic Similarity No
Defect Detection Function Vulnerability No
Cloze Test Masked Token Prediction Yes CT-all/CT-min/max
Code Completion Predict following Tokens Yes Py150
Code Refinement Bug Fixing No
Code Translation Translation between Programming Languages No
Text-Code Code Search Search code based on NL-Input Yes CodeSearchNet
Text-to-Code Generation Generate Code based on NL-Input No
Code-Text Code Summarization Generate Documentation based on code Yes CodeSearchNet
Text-Text Documentation Translation Translate documentation between NLs -

Microsoft also published two strong baseline models, CodeBERT (Feng et al., 2020) and CodeGPT (GPT by Brown et al. (2020) trained on Programming Code). CodeBERT is used for all problems that require understanding and CodeGPT for completion and generation. The pre-trained models are available easily from Hugging Face

In May 2021, IBM published CodeNet (Puri et al., 2021) a novel and extensive dataset build on student submissions in 50 programming languages and including metadata on footprint and errors.

Networks using AST

Abstract Syntax Trees represent program code an abstract tree. In Python, AST are just one standard library away. While it seems worthwhile to I have not seen recent models utilizing this meta information.


Based on the BERT (Devlin et al., 2018) transformers Microsoft trained a model on both bimodal and unimodal data of both programming code and natural language code documentation. Bimodal data refers to parallel data of natural language-code pairs, and unimodal data, where only one type (programming code or natural language) is used. Pretraining of CodeBERT is done via Masked Language Modelling (MLM - bimodal data) and Replaced Token Detection (RTD - both unimodal and bimodal data). Feng et al. (2020) leave open the question of how to best integrate Abstract Syntax Trees (AST) and whether this might be able to improve performance.


CodeTrans by Elnaggar et al. (2021) is one of the most recently published papers. The group from TU Munich, Google, and Nvidia trains T5 models on the same or similar datasets, described above, with a focus on Transfer and Multi-Task Learning. The T5 models were trained in three different sizes (small, base, large), with the training taking up to 87 days. CodeTrans with Transfer Learning or Multi-Task was able to outperform CodeBert in all Categories and languages, with the CodeTrans Multi-Task Base model performing best for Python. Again the pre-trained models are available on Hugging Face.


Deep Debug

Drain et al. (2021) from Microsoft build on their DeepDev PyTM5 transformers (Clement et al., 2020), which seems like a successor of CodeGPT especially for Python) architecture. A dataset was generated by crawling GitHub (what a coincidence that GitHub was acquired by Microsoft in 2018) for commit messages with common Bug keywords (for example “Fix Bug”). The model was trained with bi-directional data, meaning both from buggy code to fixed code and from fixed to buggy. This idea - called back-translation - is already widely used in NLP. Normalization for programming code can be difficult, and it has been shown that datasets with duplicate code can be problematic. (Allamanis, 2018) To avoid these problems DeepDev tokenizers strips comments, standardize whitespace, and replaces string and numeric literals with placeholders. A neural edit model to augment the data with synthetic bugs is also used for preprocessing. Using this model backward DeepDebug can introduce synthetic (neural) bugs. Additionally, they employ a rule-based (heuristic) system. DeepDebug increases performance (bugs found), while reducing False Positives significantly (compared to prior methods by Lutellier et al. (2020)). When Pytest stack traces are available the performance increases even further to 97% (Top-10 Success Rate). DeepDebug is said to be open-sourced, but currently, neither code nor dataset is available. (2021-06-11)

Code Generation

Code Completion can be seen as a subtask of Code Generation, starting from your input. Since this blog post was first published in July 2021 a lot has happend. First of all GitHub CoPilot

GitHub Copilot

Github Copilot is a tool that allows end-users without any machine learning skills to generate code in their favorite editor like Visual Studio Code. GitHub CoPilot can be seen as an automatic Peer Reviewer and CoProgrammer that can generate code based on function names and documentation. It is powered by a novel neuronal network based on GPT-3 Brown et al. (2020) called Codex Chen et al. (2021) build by OpenAI. As a propriatary model most of the details of the current version are unknown.

Copilot is available for free for OpenSources projects and students. Everyone else pays 10$/month.


Their overall goal was to translate docstrings of Python functions to functioning implementations and back, providing a solution that could function as a co-programmer. The training data contains 159 GB of Python programming code from openly available sources (e.g. GitHub Repositories) and natural language. Based on Codex, Codex-S was finetuned on only functions and docstrings, with the data collected from competitive programming websites and continuous integration. Codex-D was finetuned to generate docstring for existing programming code, in order to better understand models’ intentions.


Google created their own internal system for autocomplection with a similar approach. As with the Codex Model the base model is a Transformer network, here with “only” 0.5B parameters (due to latency considerations) and trained on internal code. While the training procedure is comparable (masking random code lines, leaving the rest as context), the data is different. While the other models are trained with freely available code, this means that only curated quality code (tested, and peer-reviewed) is used for training. Following a data-driven Machine Learning life cycle this could increase the overall quality of the code generated. One has to consider the use case of the system. As it is used soley internally, this means Google can allow the model to be more specific and focused on their programming style and interfaces.

Additionally Google is combining of their semantic code-completion engine with the results of the Machine Learning model: Results that are both in the Semantic Engine as well as in the ML model predication are re-ranked by boosting their order. Additionally all suggestions are checked for semantic correctness using the same Semantic Engine and cached abstract syntax trees (AST). In a hands-on test, this hybrid approach reduced context switching by 7% and overall development iteration time by 6%.


CodeGPT has not officially been published, as it is basically a finetuned GPT-2. Some implementation and training details can be found here.

Code Similarity

Code Similarity is important for Plagiarism Detection, but also for clustering code examples based on their implemented functionality. The similarity of code can be measured by simple distance measures, like the Jaccard Distance, but this is unsatisfactory. Short Distance does not mean that programs have the same functionality or solve the same problem. Puri et al. (2021) propose using a Siamese network with token sequence to achieve similarity measured based on implemented functionality.

Current Open Questions

  • How can Abstract Syntax Trees help to improve accuracy?
  • Why do CodeTrans not add extra whitespace tokens like the four spaces indent typical for Python? DeepDev is doing it and it increases throughput.
    • What Models employ this technique.


Miltiadis Allamanis. The adverse effects of code duplication in machine learning models of code. CoRR, 2018. URL:, arXiv:1812.06469.

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners. CoRR, 2020. arXiv:2005.14165. 1 2

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde, Jared Kaplan, Harri Edwards, Yura Burda, Nicholas Joseph, Greg Brockman, and others. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021.

Colin B Clement, Dawn Drain, Jonathan Timcheck, Alexey Svyatkovskiy, and Neel Sundaresan. Pymt5: multi-mode translation of natural language and python code with transformers. arXiv preprint arXiv:2010.03150, 2020.

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. BERT: pre-training of deep bidirectional transformers for language understanding. CoRR, 2018. URL:, arXiv:1810.04805.

Dawn Drain, Colin B. Clement, Guillermo Serrato, and Neel Sundaresan. Deepdebug: fixing python bugs using stack traces, backtranslation, and code skeletons. CoRR, 2021. URL:, arXiv:2105.09352.

Ahmed Elnaggar, Wei Ding, Llion Jones, Tom Gibbs, Tamas Feher, Christoph Angerer, Silvia Severini, Florian Matthes, and Burkhard Rost. Codetrans: towards cracking the language of silicone's code through self-supervised deep learning and high performance computing. CoRR, 2021. URL:, arXiv:2104.02443.

Zhangyin Feng, Daya Guo, Duyu Tang, Nan Duan, Xiaocheng Feng, Ming Gong, Linjun Shou, Bing Qin, Ting Liu, Daxin Jiang, and Ming Zhou. Codebert: A pre-trained model for programming and natural languages. CoRR, 2020. URL:, arXiv:2002.08155. 1 2

Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: evaluating the state of semantic code search. CoRR, 2019. URL:, arXiv:1909.09436.

Shuai Lu, Daya Guo, Shuo Ren, Junjie Huang, Alexey Svyatkovskiy, Ambrosio Blanco, Colin B. Clement, Dawn Drain, Daxin Jiang, Duyu Tang, Ge Li, Lidong Zhou, Linjun Shou, Long Zhou, Michele Tufano, Ming Gong, Ming Zhou, Nan Duan, Neel Sundaresan, Shao Kun Deng, Shengyu Fu, and Shujie Liu. Codexglue: A machine learning benchmark dataset for code understanding and generation. CoRR, 2021.

Thibaud Lutellier, Hung Viet Pham, Lawrence Pang, Yitong Li, Moshi Wei, and Lin Tan. Coconut: combining context-aware neural translation models using ensemble for program repair. In Proceedings of the 29th ACM SIGSOFT International Symposium on Software Testing and Analysis, 101–114. 2020.

Ruchir Puri, David S. Kung, Geert Janssen, Wei Zhang, Giacomo Domeniconi, Vladimir Zolotov, Julian Dolby, Jie Chen, Mihir R. Choudhury, Lindsey Decker, Veronika Thost, Luca Buratti, Saurabh Pujar, and Ulrich Finkler. Project codenet: A large-scale AI for code dataset for learning a diversity of coding tasks. CoRR, 2021. URL:, arXiv:2105.12655. 1 2

#artificial intelligence #machine learning #neuronal networks #nlp