It’s Like GPT-3 but for Code—Fun, Fast, and Full of Flaws

OpenAI’s new tool can autocomplete lines of programming or conjure software from a simple prompt. It could also riddle the internet with even more bugs.
AI code concept displaying construction site with scaffolding ladders a crane and workers building code
Illustration: Simoul Alva

Code pours from Feross Aboukhadijeh’s fingers.

As a devotee of the open source software movement, he has written immensely popular web apps, peer-to-peer file exchanges, and more than 100 other pieces of code that he has given away, all in the 10 years since he graduated from college. Lately, though, Aboukhadijeh has entered a new kind of flow state, helped along by a tool called Copilot. It’s a piece of artificially intelligent software that does some of the typing, and the thinking, for him.

This article appears in the April 2022 issue. Subscribe to WIRED.Illustration: Jules Julien

Built by OpenAI, the private research lab, and GitHub, the Microsoft-­owned website where programmers share code, the tool is essentially autocomplete for software development. Much as Gmail tries to finish a sentence as you write it, Copilot offers to complete a chunk of your program. The tool was released last summer to a select group of coders.

Aboukhadijeh quickly discovered that Copilot was good, almost unsettlingly so. He would begin typing a line of code, and within a few seconds the AI would figure out where he was headed—then, boom, the next four or five full lines would show up as light gray text, which he could accept by hitting Tab. When he saw it produce clean code that did exactly what he was intending, he found it a bit uncanny. “How is it getting these predictions?” he recalls wondering. “Some of them are really eerie.”

For weeks, Aboukhadijeh kept Copilot turned on while he worked. He discovered that it had other impressive tricks; it could even understand commands he wrote in basic English. If he simply typed into his code editor “Write a function that capitalizes every word in a document,” Copilot would assemble that code all by itself. He’d check to make sure it didn’t have errors; sometimes it did.

What’s more, the tool was improving his code. At one point, for example, Aboukhadijeh needed his software to recognize several different formats of text documents, so he ponderously listed all the formats, one by one, in his code. Copilot instead recommended a single, pithy command that elegantly swept them all together.

“I was like, how did it even … ?” he says, trailing off in stupefaction. He doesn’t think he’ll ever turn Copilot off.

Nor is he alone: Nine months after Copilot’s launch, tens of thousands of programmers have tried out the software. I spoke to 15 coders who’ve used it, and most, like Aboukhadijeh, found that it dramatically accelerates their pace—even as they were sometimes freaked out by how good it is. (“Just mind-blowing,” as Mike Krieger, who coded the original Insta­gram, put it.) Granted, they also noticed it making errors, ranging from boneheaded to distressingly subtle. GitHub and OpenAI have been tracking Copilot’s performance through anonymized data on how many suggested lines coders accept and how much they then store on GitHub. They’ve found that the AI writes a remarkable 35 percent of its users’ newly posted code.

Ever since computers came to be, people have hunted for ways to make them easier to program. The very first American programmers, the women who created instructions for the ENIAC machine in 1945, had an almost laughably difficult job: They had to build logic with wires. In the ’50s, tapes and punch cards made the work slightly easier. Then came programming languages with English-like syntax, some of which—such as Basic or Cobol—were explicitly designed to encourage neophytes. By the ’90s, languages such as Python automated some of the most gnarly, frustrating parts of coding, like memory management. In the 2000s, the open source movement created a generation of programmers who rarely write things from scratch.

Suffice to say, the Hollywood image of a coder frantically typing out reams of code on their own hasn’t been true for years. By stitching together chunks written by others, individuals can crank out apps wildly more sophisticated than would have been possible 20 years ago. Copilot promises to be the next significant step in this decades-long trajectory.

With Copilot, OpenAI is also offering a first peek at a world where AI predicts increasingly complex forms of thinking. In a couple of years, says Oege de Moor, GitHub Next’s vice president, coders “will just be sketching the architectural design. You’ll describe the functionality, and the AI will fill in the details.”

Follow this road, and it’s not too long a jaunt until Copilot-style AI is in the hands of the billions of people who can’t code at all. The engineers at OpenAI have already built demos that let a layperson write simple apps just by describing what they want: “Make me a personal website with PayPal embedded for payments,” or “Write an app that finds travel expenses in my bank statements and puts them in a spreadsheet.” That service isn’t public, but a clever startup could use the AI behind Copilot to build it. We could become a world simply awash in code, a sort of Gutenbergian eruption where anyone—from artists to bureaucrats to criminals to high school students—can automate their lives in a heartbeat.

Hanging on the walls of OpenAI’s offices in San Francisco are a series of paintings by Ilya Sutskever, the company’s cofounder and chief scientist. They’re an artistic stab at representing how a deep neural network processes information. Sutskever’s style is what you might call Graph Theory Surrealist: One painting shows a document with a bare eyeball staring down, hanging on its eyestalk and connected to a cluster of circles and lines. When I visited the offices in November, Greg Brockman, another cofounder, told me that Sutskever “does ’em all on his iPad Pro.” Striking but also a bit jarring, they capture some of the tension at the heart of ­OpenAI’s mission: building highly advanced AI to try to harness its power for good, while trying to ensure it doesn’t become a world-ending alien robot force.

Brockman, a bouncy sales-type guy, is a self-taught coder. Before OpenAI, he worked on “a dozen really bad startups,” including a dating app that built a profile of you based on web activity you allowed it to record. It was so creepy he wouldn’t even install the app himself. His luck turned when he became an early employee at Stripe, where he had a very successful run.

Brockman had long been fascinated by AI, though, and in 2015 he met with a group of other obsessives, including Elon Musk, Y Combinator head Sam Altman, Wojciech Zaremba (an AI veteran of Google and Facebook), and Sutskever, who’d left Google. The deep-learning revolution, where neural nets absorb patterns from data, had just taken off. Silicon Valley was abuzz with predictions that someone would suddenly unveil an “artificial general intelligence,” an AI that could outthink humans.

The OpenAI founders called this the foom moment, like the sound effect of a cinematic explosion. The group worried about existential Skynet-style risk, but also more intermediate dangers, such as how an AGI might centralize geopolitical power or give malign actors more ways to wreak havoc. They also, however, figured that a wildly capable AI could have massive upsides, perhaps by solving scientific problems to help deal with climate change or improve medical care.

They decided to create OpenAI, originally as a nonprofit, to help humanity plan for that moment—by pushing the limits of AI themselves. They’d craft powerful new systems, then let outside developers sample those concoctions. That way, everyday people could get a realistic sense of AI’s oncoming impact, bit by bit. OpenAI executives, meanwhile, figured they could learn how to minimize some of the technology’s known harms, such as neural nets’ penchant for absorbing bias from their training sets.

To critics, this approach sounded counter­intuitive—if not flat-out reckless, much like biolab researchers trying to predict future pandemics by prodding viruses to evolve faster. Brockman sees that as a limited view. In their founding charter, the men argued that “policy and safety advocacy alone would be insufficient” to contain ultrasmart AI. Brockman’s take is that to learn the real risks and benefits, you need hands-on experience. “You can’t just build this in a lab—you really need to be rolling your systems out and seeing how they impact the world,” he says.

The founders began hiring AI talent and biting off challenges. They created a neural network that trounced humans at the video game Dota 2, a feat that required the AI to master rapid-fire strategy changes. They also created MuseNet, an AI trained on so much pop music that you could use it to generate a song about your cat in the style of the Beatles.

But what made the world perk up and pay attention was the company’s AI for writing spookily realistic English. In 2019, Brockman and his colleagues released a tool called GPT-2. Trained on 40 gigabytes of text scraped off the internet, and with 1.5 billion parameters (a rough metric of how sophisticated a language model is), GPT-2 deciphered the patterns of how English words combine into sentences and paragraphs. For its time, it was unusually good at autocompleting sentences. But then OpenAI came up with GPT-3, a model with 100 times more parameters. Feed it a few words or a sentence and it would pour out an essay that often enough sounded nearly human. Ask GPT-3 a question and it would (again, quite often) give you a breezy and factually correct answer. They debated internally about how miscreants might misuse the tools, such as by creating spam, spreading political disinfo, or posting terabytes of harassment at light speed.

They decided to keep GPT-3 on a leash. Interested software developers could pay for access to it. That way, if the OpenAI folks didn’t like how someone was using GPT-3, they could easily revoke access.

By late 2020, developers had observed something unexpected about GPT-3. The AI wasn’t just good at autocompleting sentences. It could also autocomplete computer code. One of the first to notice was Sharif Shameem, the founder of a cloud gaming company. He had experimented with GPT-3 by feeding it plain English descriptions of simple web page elements (“a button shaped like a watermelon”) and discovered that GPT-3 would generate the right HTML. “This is mind-blowing,” he wrote, showing it off in a tweet that racked up 11,000 retweets, many by similarly gobsmacked coders.

The OpenAI folks had noticed GPT-3’s side hustle too. It wasn’t a terribly proficient programmer, and it couldn’t do anything complex. But “I’d gotten it to write lines of simple Python,” Brockman told me.

What was going on? It turned out that when GPT-3 was trained on those bazillion documents scraped off the web, a lot of them were pages on which nerds had posted their computer code. That meant the AI had learned patterns not just in English but also in HTML, Python, and countless other languages.

Zaremba, the OpenAI cofounder, got to thinking. After all, the lab’s goal was to eventually create a general humanlike AI that could reason in language and in logic, and understand facts about the world. Nobody knew how to do that. But maybe getting an AI that could program would be a useful interim step, since code involves lots of math and logic. At minimum, it would be a powerful new product for OpenAI to unleash on the world—more proof that AI is on the march.

“Holy shit,” Zaremba thought, as he watched GPT-3 crudely generate lines of code. “We could do this now.”

Greg Brockman's first AI project, built after high school, was a chatbot that discussed the weather.

Photograph: OPENAI

Wojciech Zaremba once led a team that built a robotic hand that could solve a Rubik's cube.

Photograph: OPENAI

In the summer of 2020, Zaremba and his team got to work on their code-writing AI. They needed to teach it two skills—how to predict lines of code in various programming languages and how to translate human-speak into machine-speak. That is, programmers should be able to give the AI a simple instruction:

// create a timer set for three seconds

And the AI should read it and crank out the right code. For instance, in JavaScript:

setTimeout(function () {}, 3000);

To develop these skills, Zaremba’s team would need to train the AI on an absolute ton of computer code. It was easy to know where to find all that. Some 73 million programmers have posted their code on GitHub, and very often it’s open source, available for anyone to use. There’s also a huge amount of code posted on websites such as Stack Overflow, a discussion board where coders ask each other for help.

All that code was amazingly well set up for training an AI. That’s because code often includes comments—bits of English written by the programmer to explain what they’re up to. (Like: “// set a timer for 3 seconds.”) Comments exist both to help others understand how a program works and to remind the authors what the heck is going on when, months later, they need to revisit their code. Programmers also write “Readme” documents that summarize what an entire program does.

In other words, software developers had served up an incredible platter of annotated training data. OpenAI’s neural network would see the English description next to the computer code and learn to associate the two. Normally AI creators spend months or years painstakingly curating such one-to-one mappings in their training data.

Over the winter of 2020 and 2021, Zaremba and his team made quick progress. To get the AI working, they discovered they needed to boost the model’s ability to understand context—the equivalent of working memory. If you (or a computer) read a piece of software, you might find that a function on line 87 relies on a variable updated on line 14. You’d need to hop back and forth in the code for any of it to make sense. Plain old written language is sensitive to context too, but not to the same degree. So Zaremba let the code-writing AI use three times as much computer memory as GPT-3 got when analyzing text.

Within a few weeks, the OpenAI engineers were seeing signs of success. When Zaremba gave the AI simple problems a student might encounter in first-year computer science—such as “Calculate the Fibonacci sequence”—it nailed them. They decided to call the new model Codex, after the original spine-bound book.

Soon they gave it to OpenAI’s staff to try out. Brockman noticed something pleasing: When he’d tried communicating with early versions of GPT-2 and GPT-3, they had seemed like “unruly children” that veered easily off target. This was different. “It felt like Codex actually wanted to listen to me in a way that GPT-3 didn’t,” he says. (It’s probably, he suspects, because of how straightforward the training data was.) Brockman also noticed that staff kept using Codex’s suggestions in their daily coding. That didn’t happen with GPT-3.

“You could see exactly where it worked,” says Katie Mayer, an ­OpenAI developer who works on Codex. But they could also see that it produced tons of bugs: A majority of its suggestions included something slightly off, like a wrong variable name. Every day, Zaremba and Brockman retrained the model, tweaking parameters to nudge the error rate lower. By the summer of 2021, 30 percent of its suggestions recommended the right code and were bug-free—hardly perfect but close enough, they figured, that coders worldwide could get some value from it. They called it Copilot, a virtual “pair programmer” that worked alongside you.

As with GPT-3, Mayer and the team decided to offer it as a service. Microsoft would host Copilot in its cloud servers. The tech giant had become a major investor in OpenAI in 2019, when the founders realized that training AI required wildly expensive computer-­processing time. To attract capital, OpenAI’s leaders created a for-profit division of their organization, with a promise that investors could eventually make money on OpenAI’s discoveries. Microsoft put in $1 billion and became OpenAI’s sole provider of cloud computing. Critics argued that, by chasing profit, OpenAI had “sold its soul”; the founders countered that their charter, which promised that their “primary fiduciary duty is to humanity,” was still the guiding principle.

Either way, Microsoft became a central hub in Copilot’s debut. To use the tool, coders would need to install a plug-in to Visual Studio Code, Microsoft’s editing tool for writing code. As the programmers worked, the plug-in would watch what they typed and send it to the Microsoft cloud, and the AI would send back suggestions.

Before releasing it, OpenAI’s security team tried to grapple with the potential abuses ahead. For example, coders often sloppily leave private details in their code—phone numbers, names, emails—and Codex, having absorbed all those, might spit them back out while generating code. The security team set up filters to try to strip those out. They also worried Codex would help make malware viruses easier to write, though when they tried to write malware themselves they didn’t think the tool helped enough to be dangerous. A more realistic worry, they decided, was “scaled abuse.” Someone using the AI could, say, rapidly author a cloud of Twitter bots to harass a female politician or spread disinformation during an emerging news event. To prevent that, they added “rate limiting,” intentionally slowing the pace at which Codex offers suggestions. It “walks at the speed of a human,” as Matt Knight, OpenAI’s head of security, told me.

On June 29, 2021, they let coders worldwide have their first crack at Copilot.

Illustration: Simoul Alva

I was one of them. I’m a journalist, not a software developer, but I do a bit of coding too—sometimes to build artistic side projects, such as a search engine for weird old books, and sometimes to help with my journalism. I’ll write scrapers to automatically grab information from websites, or I’ll transform and analyze textual data. Would Copilot give me superpowers?

Not at first. I began by writing some code for a server to render web pages, and Copilot did a respectable job of predicting what I wanted. But it didn’t blow me away. Simple autocomplete code-writing tools have existed for years, and I’d long been using and enjoying one called TabNine. Copilot did not seem remarkably better.

But then I remembered I could just chat with the AI. That’s when things took off.

One evening I’d downloaded a company’s annual report as a PDF, and I wanted to write code in Python that would open the file, locate every dollar figure, and show it to me. I suspected there was some simple way for Python to open and read PDFs, but I didn't know of it. Maybe Copilot did?

So, using Visual Studio Code, I wrote a plain ol’ comment describing the first step I needed done:

# write a function that opens a pdf document and returns the text

For about four seconds, nothing happened. Then Copilot wrote this:

def pdf_to_text(filename):
pdf = PyPDF2.PdfFileReader(open(filename, “rb”))
text = “”
for i in range(pdf.getNumPages()):
text += pdf.getPage(i).extractText()
return text

In one glance, I could see that it did exactly what I’d asked for. Looking more closely, I saw that Copilot had invoked a chunk of open source Python code—PyPDF2 —I’d never heard of. When I Googled it, I learned that PyPDF was, indeed, designed specifically to read PDF files. It was a strange feeling. I, the human, was learning new techniques from the AI. I plowed ahead, writing more prompts urging Copilot to produce more and more of my software. It complied, like a helpful genie: My wish was its command.

In about five minutes I’d finished the whole piece of software. It worked perfectly. And it saved me at least a half-hour of work—time I would have spent Googling to discover PyPDF2, then reading up on how to use it. Copilot blasted right through that mountain for me.

Over the next few days, I wrote little pieces of software—a script to clean up data for a visualization, a scraper for pulling posts off a forum—in a blur. Copilot wasn’t always successful; sometimes it suggested code that worked but wasn’t what I was after. Other times it got the idea right but messed up the names of the variables.

I also had to develop a new skill: I learned how to talk to the AI. This meant being incredibly precise. Much like the genie of legend, Copilot would do exactly what I asked, so if I made the wrong monkey’s-­paw wish, I got it.

Other developers told me the same thing. They began to develop a “theory of mind” about how Copilot works, the better to communicate with it.

“There’s a little bit of an art to it,” says Andrej Karpathy, an AI guru. Currently the head of AI for Tesla, he was a founding researcher at Open­AI. “It’s a very foreign intelligence, right? It’s not something you’re used to. It’s not like a human theory of mind. It’s like an alien artifact that came out of this massive optimization.” Karpathy was among the first to try Copilot. Initially he found it “gimmicky and a little bit distracting” and set it aside. But when he tried it again in the late fall of 2021, he began figuring out how best to interact with it. “I’m pretty impressed and kind of excited,” he concludes.

In long discussions online, coders debated the tool. Some weren’t thrilled that code they had put on GitHub for other humans to use had been masticated by an AI and turned into a potentially lucrative product for its owner. They wondered about the legality too. Sure, they’d posted the code as open source; but does treating it as training data count as a fair use? De Moor at GitHub says he believes they’re in the clear. But as the Stanford legal scholar Mark Lemley has written, this question hasn’t yet gone before a judge, so no one can be sure.

But many coders were stoked that Copilot replaced some of their incessant Googling. “Ninety-five percent of the code that I write has already been written by someone,” says Rob van Haaren, the founder of Prophit.ai, a startup that helps companies navigate tax codes.

Copilot even seems to have picked up knowledge about specific fields. Maria Nattestad is a software engineer at Google and, on the side, the author of a popular app that makes eye-catching visualizations from bioinformatics data. She discovered that Copilot knows about DNA. When she wrote code to organize genetic data, the AI showed that it understands that codons—specific sequences of DNA or RNA—have a length of three, and it could generate a list of them on its own.

Nattestad has also discovered that Copilot makes rookie mistakes; it once composed a sprawling list of “if-then” statements, violating the basic “Don’t repeat yourself” principle of coding. Still, she uses the AI every time she works on personal projects, because it helps her move at a blistering pace. “I’ll be like, ‘I was planning to work on this all evening,’” she says. “Then I find the whole thing is done in less than an hour.”

Mind you, Nattestad only uses Copilot when she codes as a hobby. She never uses it at work at Google, because Copilot is constantly communicating with Microsoft’s servers—and Google can’t allow its code to leave the building. Karpathy can’t use the tool at Tesla for the same reason. “This is Tesla IP, right? We protect this code,” he tells me. It’s one of the tensions in OpenAI’s strategy for bringing advanced AI to the masses. In its charter, OpenAI vowed to prevent the tech from becoming centralized and benefiting only a narrow slice of society. While theoretically anyone can get permission to use Copilot and GPT-3, OpenAI’s entire business model is deeply centralized, running through a Microsoft server with access that OpenAI can revoke at any instant.

For the moment, Copilot is not threatening too many power structures. Today the main concern might be its errors. Karpathy has seen it generate code with subtle bugs that, depending on the context, could be anywhere from trivial to catastrophic. At one point Copilot generated a seven-line chunk of code that was accurate except for one character: Copilot had used a greater-than sign (“>”) where it should have used a greater-than-or-equal-to sign (“>=”). This mistake produced what’s known as a “fencepost” bug, a common flub in which an operation falls one short of, or goes one more than, what’s intended (like a fence with the wrong number of posts). When Karpathy tweeted the example, Brendan Eich—the inventor of JavaScript and CEO of the browser company Brave—replied with concern. “You caught the fencepost bug,” he tweets, “but how many likely users would?”

Hammond Pearce, a computer engineering professor at New York University, led a team that studied how Copilot wrote code in scenarios that ought to be secure. He found that a full 40 percent of the time, it produced software that was vulnerable—in particular, to SQL injection, a well-known attack that allows bad actors to insert malicious code. In the worst case, attackers could gain total control of a victim’s servers.

Nattestad summed up Copilot’s output as a dice roll: When it works, it’s great. But when it fails it can fail badly. “You clearly have to know what you’re doing, because otherwise you’re just doing a really crappy job faster,” she told me.

I’d heard a related concern in my conversations with developers: that Copilot will make them sloppy or will blunt their skills. There’s some truth to it, agrees Brian Kernighan, a computer scientist at Princeton and pioneer of languages back in the ’70s. Today’s programmers are far less likely to know the gnarly details of how a computer’s memory or processor works. The fear of deskilling in software development is old. But the gain in productivity, he figures, is worth it: “For most people, most of the time, it’s just a wonderful trade-off.”

Perhaps more dramatic is how Copilot may change the structure of coding work, says Pamela Mishkin, a researcher at ­OpenAI. Over time, the emphasis will shift to “How do you check the work of the model?” she says. “It switches you from being a writer into an editor.”

Some coders I spoke to had a more universal worry: that eventually, a Copilot-like AI might render their jobs obsolete. There are, after all, now several companies producing AI that writes code, including TabNine and one recently debuted by Alphabet’s AI research outfit, DeepMind. There was something deeply—perhaps gorgeously—ironic about hearing the makers of software nervously fearing pink slips delivered by software itself. The authors of automation are getting a taste of the icy fear that comes from watching machines go after your labor.

Illustration: Simoul Alva

Some coding jobs might well disappear—but the number of code creators could shoot way up. Everybody could start weaving bits of code-writing into their lives.

This thought sank in when I joined a Zoom call with Andrew Mayne, a novelist and programmer who works for OpenAI in a sort of publicity role. He showed me a prototype they’ve created for nonexperts to talk to Codex. Mayne started by typing a silly command: “Build me a website for a cat lawyer.” Codex duly began writing the HTML. It grabbed a photo of a cat, plunked it in place, and even tossed in some text. (“Mr. Whiskers, I’m a lawyer.”) Mayne laughed. Then he tried a more serious application: “Create a Python app that gets the price of bitcoin.” A few seconds later, that code appeared too, and it worked.

As I watched his demo, I got to thinking about what might happen if everyone—not just programmers—could automate the dull stuff in their lives by writing little throwaway programs. As Zaremba notes, Codex could make programming so easy that this sort of casual, automate-my-life scripting could explode. It would be like what happened with HTML: In the ’90s, creating web pages was a manual-­labor task and thus the province of either coders or those who could afford to hire one. But once blogging tools made building a website point-and-click easy, the internet erupted with personalized sites—mom-and-pop pizza restaurants, ardent fans of bands. Zaremba imagines a similarly catalytic effect if code-writing AI were, say, built into voice assistants. In the middle of cooking dinner, you might ask the assistant to tackle a tedious part of your day job: “Every Tuesday at 3 pm, take the sales figures from my boss’s Word memo, make a chart, and email it to everyone on my team.” And it could whip up the half-dozen lines of code on command.

Maybe this is what it will be like for people to get AI superpowers, not just in headline-­generating ways (“AI Conquers Cancer!”) but in deeply mundane ones (“AI Lets Area Man Optimize His Spreadsheet”).

The OpenAI researchers used to worry that supersmart AI would arrive suddenly, utterly transforming society and threatening the place of humans. Brockman now thinks there will be no single foom. Instead, there’ll be a series of smaller ones. Humans will have some years to adjust, he predicts, as more competent models arrive, one by one.

For now, though, Copilot is more a hint of the future than the future itself. After four months of using Copilot, I found that the tool hasn’t transformed the way I write software. But I can feel how it’s gently trickling into my habits of mind while programming—how I’m learning to quickly assess its ideas, batting away the dumb ones and pouncing on the great ones, as if I were talking to a shoulder-­surfing colleague. Perhaps the future of AI is going to be this constant dance, and dialog, with the machine.

Either way, I haven’t turned it off.


This article appears in the April 2022 issue. Subscribe now.

Let us know what you think about this article. Submit a letter to the editor at mail@wired.com.


More Great WIRED Stories