Why Github Copilot's Generated Code Should Be Fair Use
According to U.S. Copyright law

May 10, 2023 (1y ago)

Why Github Copilot's Generated Code Should Be Fair Use

Computers are now (kinda) capable of coding themselves!

While that statement is a bit hyperbolic, pair programming plugins such as Github Copilot are proving to be powerful tools in the software engineering space. As a computer science major excited by the space of natural language processing (NLP), it’s incredible to see models that can generate useful, functioning code.

Copilot doesn’t replace the programmer (yet), but as code “autocomplete” that can finish functions, I’ve found it immensely useful in personal projects over the last year. Yet, the legality surrounding the use of tools such as Copilot is still extraordinarily fuzzy, despite what Microsoft may claim.

I've recently been learning about copyright and fair use as part of Yale’s “Law, Tech, & Culture”. This class has shown me how the law, code, and norms surrounding remixing in fair use are all based on human reimagining; we have not yet fully determined how these decisions should be applied to machine-generated output.

In this post, I'm going to discuss how under current law, I believe that code such as that created by Github Copilot is derivative work. However, I will also argue that such generated code should also be considered fair use and legal under the Copyright Act.

Is it “fair use” to train AI on code you don’t own?

One of the primary reasons why it matters whether or not the output of these algorithms is derivative of the training data is because the data used to train (read: make it work) any particular model is rarely owned by its creator. For example, the massive database of images used to develop Stable Diffusion or ImageNet, or the entire codex of the web for GPT-3.

Github proclaims on their site that “training machine learning models on publicly available data is considered fair use across the machine learning community.” But there is little legal precedent to support that statement.

You could argue that previous cases such as Author’s Guild v. Google support this view, whereby Google acted without permission of rights holders to make digital copies of millions of books and successfully held that as transformative fair use. However, it’s clear that there is something fundamentally different between that situation and one in which a developer feeds a model all that textual data.

While I am less certain that precedent supports the idea that all public data is fair use for training, I certainly do. If a court were to declare training models on publicly available data isn’t acceptable, the difficulty and cost of training AI would skyrocket. This would be a massive impediment for AI research and innovation - the thing that copyright is intended to promote and protect.

As an interesting thing to consider, if a court were to rule against this assumption that these models are being trained on public data lawfully, the rest of this entire discussion could be moot. This is because section 103 of the Copyright Act states that copyright includes derivative works but not any part of the work created using material unlawfully.

Still, if we do assume that the use of public training data is fair, that doesn’t necessarily render the code a model such as Copilot returns not derivative.

Is AI generated code derivative of existing code?

While other tech bros and researchers in the AI space will hate this take, I believe the only viable conclusion you can come to based on current law is that Github Copilot and other tools powered by large language models (LLMs) are currently generating derivative works.

Here’s why: when you really consider how these neural networks function, they’re essentially being trained to autocomplete different functions based on a text prompt. Through accessing the mountains of training data (literally billions of parameters) that I just discussed, an algorithm can regurgitate a slightly rephrased line of code that best fits the current context.

A popular argument in favor of treating the output of LLMs as original is comparing models to humans. Everything everyone makes is derivative of previous works!

If I learn how to paint by admiring Picasso and Van Gough, and then I make my own paintings in those styles, am I infringing on their works? If I grow up singing Taylor Swift and then I write my own songs inspired by her signature bridges, that’s not copyright infringement, right?! So why would it be any different for a generative AI algorithm?

My answer is a bit technical, but essentially, computers are incapable of being truly creative. The illusion is impressive, but if you step behind the curtain and identify the specific ‘hash' that returns a specific output, you'll be able to obtain that same output every single time.

Additionally, the differences between input data and output can sometimes be imperceptible, to the point of being blatant plagiarism. This is troubling in the case of Copilot, where it can return solutions that may have been unique and protected by an open-source attribution license.

For all these reasons it is clear that generated code is derivative of the codebases it was trained on.

So generated code is derivative. Does that mean I should need to credit someone when I use it?

Despite being derivative of other works, in my view, generated code should be fair use. You should not need to cite any source or attribute any license - mit, cc, agpl-3.0, apache-2.0, or otherwise. I’ll prove this using the four factors USC 107: Limitations on Exclusive Rights defines to determine fair use:

“Purpose and character” of generated code

This is fairly straightforward - while the intent of each individual code completion will be different, every time you are using a pair programming tool such as Github Copilot you are autocompleting a snippet of code.

The purpose of the code is to finish your line, function, code block, etc. The code itself can vary dramatically in length, complexity, and how derivative it is of other works.

What mostly matters here is how transformative the code is. Notably, for a pair programming tool, the vast majority of the code it helps author is standard web components, data structures, and algorithms (since that’s what it is good at). Scenarios in which truly novel code is written are few and far between.

In Oracle v. Google, it was determined that Google’s use of certain Java Application Programming Interfaces (APIs) was lawful fair use. This was literally Google winning a case in which they reused code that was derivative of pre-existing software. I do not view Copilot’s re-use or re-implementation of software interfaces written by others as meaningfully different.

When more niche code is generated, it is traditionally at least somewhat transformative of wherever it was sourced from. Moreover, the intent for the purpose the code will serve in the greater project will be different.

Generated code could be used for a for-profit venture or a side project, meaning it’s difficult to say how commercial the use is except on a case by case basis. Because it is transformative when necessary and not always commercial, the first factor is in favor of fair use.

“Nature” of generated code

This factor is immediately followed in every fair use court case by an acknowledgement that it is usually irrelevant and so I will ignore it too.

“Amount and substantiality” of generated code “in relation to the copyrighted whole”

In the case of Copilot and many other LLMs, the data they are trained on is literally millions of code repositories. When code is auto-completed, it is an agglomeration of any number of these sources. It is simply infeasible for the end user to attempt to ascertain where each piece of code may have been drawn from, and then attribute that source.

The level of requested attribution may also vary dramatically - some code may have been inspired by public repositories with a strict license, and some from repositories that don’t even have one. Even in cases where an entire line is copied verbatim, it’s being included in an entire new code file that you’re writing, the majority of which is original. Therefore the percentage of code in your overall project that is derivative of any one work is minuscule and insignificant.

Due to the gargantuan number of potential sources and the rarity of  large and truly unique code snippets from any one source, as well as the infeasibility of citing auto-completed code in practice, factor three is strongly in favor of fair use.

“Effect of the use of” generated code “on the potential market for the copyrighted work”

Again, because of the fact that this code has thousands of potential sources that could be infringed upon, it’s difficult to ascertain how impactful it could be on all of them.

However, to build on what I discussed for factor one, given that the majority of the use cases for a product such as Copilot are for pieces of code that are not novel - “public pieces of information” such as functions creating linked lists or inverting binary trees, it is unlikely that any generated code is going to make an impact on the potential market for the work it is derivative of.

The original project that a programmer is working on and that the auto-completed code is a part of will also not likely impact that available market. And even if that project does end up competing, it will not be because of the generated code - the programmer would have written that anyway without Copilot, or copied it off someplace online such as Stack Overflow.

For these reasons, I find that the fourth factor is also in favor of declaring generated code fair use.

A culture of remixing and reimagining

“Remix culture” is core to the open-source programming community. It’s “in many ways a return to the local, participatory cultures that existed before the relatively rise of the mass culture industries,” and argues that “we must all continue to see ourselves as makers, doers, and creators.”

As also discussed in the Oracle v. Google case, this idea of encouraging the reuse of code is “a custom that underlies most of the internet and personal computing technologies we use every day.”

While I’ve outlined that the remixed code generated by pair programming tools such as Github Copilot is derivative work, I have also made a strong case for that code being fair use. And it needs to be to allow for the continued pace of innovation and progress in this space that we’ve seen over the last few years.

Generative text, images, and code will change the way we read, watch, and program. Let’s ensure our laws support that future.