Code-generating systems like DeepMind’s AlphaCode, Amazon’s CodeWhisperer and OpenAI’s Codex, which powers GitHub’s Copilot service, provide a tantalizing look at what’s possible with AI today within the realm of computer programming. But so far, only a handful of such AI systems have been made freely available to the public and open sourced — reflecting the commercial incentives of the companies building them.
In a bid to change that, AI startup Hugging Face and ServiceNow Research, ServiceNow’s R&D division, today launched BigCode, a new project that aims to develop “state-of-the-art” AI systems for code in an “open and responsible” way. The goal is to eventually release a data set large enough to train a code-generating system, which will then be used to create a prototype — a 15-billion-parameter model, larger in size than Codex (12 billion parameters) but smaller than AlphaCode (~41.4 billion parameters) — using ServiceNow’s in-house graphics card cluster. In machine learning, parameters are the parts of an AI system learned from historical training data and essentially define the skill of the system on a problem, such as generating code.
Inspired by Hugging Face’s BigScience effort to open source highly sophisticated text-generating systems, BigCode will be open to anyone who has a professional AI research background and can commit time to the project, say the organizers. The application form went live this afternoon.
“In general, we expect applicants to be affiliated with a research organization (either in academia or industry) and work on the technical/ethical/legal aspects of [large language models] for coding applications,” ServiceNow wrote in a blog post. “Once the [code-generating system] is trained, we’ll evaluate its capabilities … We’ll strive to make evaluation easier and broader so that we can learn more about the [system’s] capabilities.”
In collaboratively developing a code-generating system, which will be open sourced under a license that’ll allow developers to reuse it subject to certain terms and conditions, BigCode is seeking to address some of the controversies that have arisen around the practice of AI-powered code generation — particularly regarding fair use. The nonprofit Software Freedom Conservancy among others has criticized GitHub and OpenAI for using public source code, not all of which is under a permissive license, to train and monetize Codex. Codex is available through OpenAI’s paid API, while GitHub recently began charging for access to Copilot. For their parts, GitHub and OpenAI continue to assert that Codex and Copilot don’t run afoul of any license terms.
The BigCode organizers say they’ll take pains to ensure only files from repositories with permissive licenses go into the aforementioned training data set. Along they way, they say, they’ll work to establish “responsible” AI practices for training and sharing code-generating systems of all types, soliciting feedback from relevant stakeholders before making policy pronouncements.
ServiceNow and Hugging Face provided no timeline as to when the project might reach completion. But they expect it to explore several forms of code generation over the next few months, including systems that auto-complete and synthesize code from snippets of code and natural language descriptions and work across a wide range of domains, tasks and programming languages.