I’m sympathetic to the NYT, even if it’s not reproducing their IP verbatim.
AI companies need to acknowledge that their LLMs would be worthless without training data and compensate/credit the sources appropriately.
It’s not just that it circumvents the paywall, it makes up random nonsense and then claim the NYT said it.
I’ve never got why people don’t see this about AI. When it “works” it’s just spitting out what a human was paid (Avery low wage) to write, when it has to come up with something that hasn’t been written, it just slaps nonsense together.
It’s not real AI, it’s just next generation search engines that gives unreliable results.
You just don’t notice if you don’t already know what you’re asking.
Even tho these LLM work by just figuring out next word (token) that makes sense, it is still able to generate things that no human has ever written before. It isn’t just copypasting stuff together.
I use GPT4 daily basis on coding and the way it spills out complex code templates/snippets, which are unique to the problem, is not just not possible without model having some level of intelligence. Of course it hallucinates now and then, but so does most of the coders now and then
deleted by creator
Thanks for reading the article and addressing the claims instead of making up stuff to be mad about…
Oh wait… 🙄
Large language models are just like humans.
…humans don’t accidentally plagerize whole articles. They also understand the difference between theft and fair use, and AI has been shown to not respect that distinction multiple times. You can also sue humans for damages when they steal from you. Apparently LLM are immune to legal liability because oopsie poopsie mistakes happen uwu.
LLMs are cool and useful, but if they’re harming the data sources they wouldn’t exist without, shouldn’t we do something?
I’ve been teaching academic writing for the last ten years and would strongly object to your first two assertions 😄
Lmao yeah, fair enough.
Edit: I think the important word is “accidentally “ on that first point. 😉
Never gonna happen.
The NYT might win some money based on what Microsoft published, but only to the same extent as if a human wrote that and Microsoft published it. Copyright will never be an issue for training data because training is just scanning text and guessing the next letter. Consuming an entire library to make up anything you ask for is pretty goddamn transformative.
Oh, does the model know the names of characters in a popular book? So do Google and Wikipedia. Try framing a law that’s cool with Google having a whole searchable plain-text copy of a book, so it can go ‘this book?’ when you search for a quote, but forbids OpenAI from having the essence of that book distilled somewhere in its terabyte of inscrutable numbers.
This fight is over.