Generative AI systems have acquired capabilities previously unimaginable, such as producing reams of plausibly human text, summarizing complicated documents, suggesting novel drug formulations, or creating works of art inspired by any number of human artists or styles. Now, large language models, a form of generative AI, have been brought to bear on the very technology that underpins them: computer coding.
Amazon CodeWhisperer is a new cloud-based capability provided by Amazon Web Services that uses machine learning and large language models to make developers’ lives easier and boost their productivity.
CodeWhisperer works within a developer’s primary workspace, known as an integrated development environment (IDE). As developers build their code, they typically leave notes or comments in natural language describing, for example, the purpose of the next block of code or, indeed, the overall purpose of the program. The system looks at not only the code already produced in the IDE but also the developer’s comments and then, in real time, suggests what it predicts would be a useful next chunk of code.
“CodeWhisperer is not just auto-completing a few words or a line of code,” says senior applied-science manager Parminder Bhatia, who leads the CodeWhisperer science team. “It can generate 15, 20, 30 lines, all on the fly. And this is not code copied and pasted from elsewhere; it has been created and customized to suit the developer’s intent, incorporating coding best practices.”
“Innovation occurs when developers spend time on novel and creative work,” says Bing Xiang, director of applied science at the AI Labs of Amazon Web Services. “Generative AI like CodeWhisperer can easily handle the undifferentiated coding and reserve human interaction for high-judgement situations.”
This sort of assistance has only just become possible, Bhatia adds. “AI has accelerated in the last five years to the point at which these large models can understand and reason sufficiently to provide contextualized recommendations.” And the more code and notes a developer produces, Bhatia explains, the better CodeWhisperer understands the intention of that code, so its suggestions become better tailored and more nuanced.
The downside of using public datasets to train AI models like CW, of course, is that they can reflect undesirable aspects of the wider world, including imperfect security, toxicity, and unfairness or bias toward specific groups; they can also reveal personal identifiable information.
“At Amazon CodeWhisperer, we take such concerns seriously,” says Ramesh Nallapati, senior principal scientist at AWS AI Labs. “We design our system to help remove security vulnerabilities in a developer’s entire project. We also address the toxicity and fairness of the generated code by evaluating it in real time and taking necessary steps to reduce exposure to the user from such content.
“In addition to toxicity and bias filtering, CodeWhisperer’s reference tracker feature can also identify instances where code generations may be similar to particular training data. The developer can then inspect the reference repository and make a decision whether or not to use the code, including whether to take a dependency or license from the reference repository.”
One of the other challenges the team faced in developing the system involves both sustainability and speed. For CodeWhisperer to be any use to developers, its suggestions need to appear in a split second. A good idea arriving 20 seconds too late would be a distraction, not a help. The challenge is that running large models requires serious computational resources — not ideal when time is of the essence.
“We deal with the latency problem by leveraging a variety of techniques, including model quantization and memory access reduction techniques developed in-house, which allow for multiple recommendations without incurring extra latency cost,” says Xiang. “These efficiencies also boost the sustainability of the tool.”
CodeWhisperer is just one of a raft of projects with generative AI and large language models at their heart that Xiang’s extensive science team is working on. Their topics range from search and recommendations to question answering and information extraction.
With the aim of supporting the wider machine learning (ML) community in developing code-generating models, Xiang’s team has developed a benchmarking tool supporting the evaluation of code generation abilities in 10+ programming languages. To achieve this, the team developed a novel transpiler — a programming-language conversion tool — that automatically converts the input texts and test cases of a popular Python benchmarking dataset (Most Basic Programming Problems, or MBPP) into their multi-lingual counterparts. They describe the resulting collection of benchmarking datasets, which they call MBXP, in a paper that is currently under conference submission but available as a preprint on arXiv.
The tool can be used not only to evaluate the quality of generated code in a variety of programming languages but also to explore the broader aspects of code-producing language models. For example, it can be used to probe the question of how well large language models can generalize to other programming languages on which they have not been specifically trained (spoiler alert: surprisingly well, in some cases).
“Multilingual evaluation also enables us to discover intriguing capabilities of language models, such as their zero-shot translation abilities, where a model can use a reference code in language A to help write code in language B more accurately,” says Ben Athiwaratkun, an ML scientist at Amazon and first author on the paper. “MBXP allows us to investigate other aspects of code generation models, such as robustness to input, code insertion abilities, or the effects of few-shot samples on reducing syntax errors, all in a multilingual fashion.”
By publicly releasing this multilingual code evaluation benchmark, the team hopes to accelerate research in this nascent field. “And because the language conversion is automated,” Athiwaratkun says, “we can easily expand the benchmark to include new programming languages in the future, without the need for an extensive annotation loop.”
The CodeWhisperer product and these research-focused innovations are just the beginning of what ML can do for software developers, Bhatia explains. “Just as large language models can reliably translate spoken languages, we can expect the same to follow for translating between programming languages,” he says. “Today, not only can CodeWhisperer produce code on the basis of natural-language comments, but it is also making inroads toward summarizing in natural language what a given piece of code is intended to do.”
What this is heading toward, in some sense, is the democratization and demystification of coding. Ultimately, the power of coding will not reside solely in the capacity of an individual or group to painstakingly piece code together.
Consider the proliferation of generative-AI art. Now, anyone with an imagination can create incredible artworks with just a few prompt words expressing an artistic intention. The automation of coding hasn’t advanced as far, but AI’s increasingly high-level comprehension of both coding and natural language will not only boost the professional capability of developers but also open up coding to a much wider audience. “This is a giant effort,” says Bhatia. “This is a paradigm shift.”