Meta admits using pirated books to train AI, but won't pay for it

Lee Duna@lemmy.nz · 10 months ago

Meta admits using pirated books to train AI, but won't pay for it

BreadstickNinja@lemmy.world · 10 months ago

Critical to understanding whether this applies is to understand “use” in the first place. I would argue it’d even more important because it’s a threshold question in whether you even need to read 107.

17 U.S. Code § 106 - Exclusive rights in copyrighted works Subject to sections 107 through 122, the owner of copyright under this title has the exclusive rights to do and to authorize any of the following: (1)to reproduce the copyrighted work in copies or phonorecords; (2)to prepare derivative works based upon the copyrighted work; (3)to distribute copies or phonorecords of the copyrighted work to the public by sale or other transfer of ownership, or by rental, lease, or lending; (4)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and motion pictures and other audiovisual works, to perform the copyrighted work publicly; (5)in the case of literary, musical, dramatic, and choreographic works, pantomimes, and pictorial, graphic, or sculptural works, including the individual images of a motion picture or other audiovisual work, to display the copyrighted work publicly; and (6)in the case of sound recordings, to perform the copyrighted work publicly by means of a digital audio transmission.

Copyright protects just what it sounds like- the right to “copy” or reproduce a work along the examples given above. It is not clear that use in training AI falls into any of these categories. The question mainly relates to items 1 and 2.

If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1. If you put a model into an output loop you can get it to reproduce small sections of training data that include passages from copyrighted works, although of course nowhere near the full corpus can be retrieved because the model doesn’t contain any thing close to a full data set - the models are much too small and that’s also not how transformers architecture works. But in some cases, models can preserve and output brief sections of text or distorted images that appear highly similar to at least portions of training data. Even so, it’s not clear that this is protected under copyright law because they are small snippets that are not substitutes for the original work, and don’t affect the market for it.

Case 2 would be relevant if an LLM were classified as a derivative work. But LLMs are also not derivative works in the conventional definition, which is things like translated or abridged versions, or different musical arrangements in the case of music.

For these reasons, it is extremely unclear whether copyright protections are even invoked, becuase the nature of the use in model training does not clearly fall under any of the enumerated rights. This is not the first time this has happened, either - the DMCA of 1998 amended the Copyright Act of 1976 to add cases relating to online music distribution as the previous copyright definitions did not clearly address online filesharing.

There are a lot of strong opinions about the ethics of training models and many people are firm believers that either it should or shouldn’t be allowed. But the legal question is much more hazy, because AI model training was not contemplated even in the DMCA. I’m watching these cases with interest because I don’t think the law is at all settled here. My personal view is that an act of congress would be necessary to establish whether use of copyrighted works in training data, even for purposes of developing a commercial product, should be one of the enumerated protections of copyright. Under current law, I’m not certain that it is.

TWeaK@lemm.ee · 10 months ago

(1)to reproduce the copyrighted work in copies or phonorecords

The works are copied in their entirey and reproduced in the training database. AI businesses do not deny this is copying, but instead claim it is research and thus has a fair use exemption.

I argue it is not research, but product development - and furthermore, unlike traditional R&D, it is not some prototype that is different and separate from the commercial product. The prototype is the commercial product.

(2)to prepare derivative works based upon the copyrighted work

AI can and has reproduced significant portions of copyrighted work, even in spite of the fact that the finished product allegedly does not include the work in its database (it just read the training database).

Furthermore, even if a human genuinely and honestly believes they’re writing something original, that does not matter when they reproduce work that they have read before. What determines copyright infringement is the similarity of the two works.

If you read through the court filings against OpenAI and Stability AI, much of the argument is based around trying to make a claim under case 1.

The position that I take is that the arguments made against OpenAI and Stability AI in court are not complete. They’re not quite good enough. However, that doesn’t mean there isn’t a valid argument that is good enough. I just hope we don’t get a ruling in favour of AI businesses simply because the people challenging them didn’t employ the right ammunition.

With regards to Case 2, I refer back to my comment about the similarity of the work. The argument isn’t that the LLM itself is an infringement of copyright, but that the LLM, as designed by the business, infringes copyright in the same way a human would.

I definitely agree it is all extremely unclear. However, I maintain that the textual definition of the law absolutely still encompasses the feeling that peoples’ work is being ripped off for a commercial venture. Because it is so commercial, original authors are being harmed as they will not see any benefit from the commercial profits.

I would also like to point you to my other comment, which I put a lot of time into and where I expanded on many other points (link to your instance’s version): https://lemmy.world/comment/6706240