If we assume LLMs are as revolutionary as you are suggesting, then how is model collapse an easy problem to solve? Google is a relic of the past, the internet is filled with AI generated content; then where will the training data come from? We can’t replace human generated content with AI generated content without an inevitable model collapse.
Oh and btw, good luck with differentiating between human generated and AI generated. Already, social media sites are being cluttered with AI generated content, Amazon book publishing being cluttered with shit tier LLM generated “books” (cheap immitations), and if academia goes this way, and entertainment as many speculate, there’s hardly anything left.
Oh and btw, good luck with differentiating between human generated and AI generated.
One easy way to do this is to check if it was generated before 2023. Not so much AI-generated content from before then.
Amazon book publishing being cluttered with shit tier LLM generated “books”
So filter the books based on how “shit tier” they are.
In the end, what’s needed to train AIs is good content. If some of that good content is itself AI-generated, who cares? You need to be selective in how you pick training material anyway.
Yes. So add relevant new data along with the older stuff. The problem is not that AI-generated content is magically “poison” somehow. Model collapse happens when you lose rare data from repeated generations of training data generated by AIs.
A simple way to imagine it is training an AI by showing it random coloured marbles out of a bucket and then asking it to fill the next AI’s bucket with new marbles to train on. If there’s just one single blue marble in the first bucket then it’s easily possible that the AI will fail to put a blue marble in the second bucket, after which there will never be a blue marble again if that’s all that subsequent AIs have to train off of. But if each time you train a new AI you reuse half the marbles from the first bucket again, you can have that blue marble show back up again in future AIs.
If LLMs are as revolutionary as the zealots believe, then there will exist less and less blue marbles in the universe with each iteration. So either the bucket gets smaller or the ratio of blue marbles gets smaller.
But if each time you train a new AI you reuse half the marbles from the first bucket again, you can have that blue marble show back up again in future AIs.
The original bucket containing the blue marble isn’t going anywhere. It still exists. The blue marble will always be available to mix into future AIs. All you have to do is make sure you’re using some historical data (or otherwise guaranteed “human-generated”) along with whatever new unvetted stuff you’re using.
All you have to do is make sure you’re using some historical data (or otherwise guaranteed “human-generated”) along with whatever new unvetted stuff you’re using.
Emphasis added. Please read more carefully, this is getting repetitive. You keep assuming that the AI will be trained either entirely with old data or entirely with new data and that’s just not the case.
If we assume LLMs are as revolutionary as you are suggesting, then how is model collapse an easy problem to solve? Google is a relic of the past, the internet is filled with AI generated content; then where will the training data come from? We can’t replace human generated content with AI generated content without an inevitable model collapse.
Oh and btw, good luck with differentiating between human generated and AI generated. Already, social media sites are being cluttered with AI generated content, Amazon book publishing being cluttered with shit tier LLM generated “books” (cheap immitations), and if academia goes this way, and entertainment as many speculate, there’s hardly anything left.
One easy way to do this is to check if it was generated before 2023. Not so much AI-generated content from before then.
So filter the books based on how “shit tier” they are.
In the end, what’s needed to train AIs is good content. If some of that good content is itself AI-generated, who cares? You need to be selective in how you pick training material anyway.
LLMs need updated training data to stay relevant.
And how exactly are you going to curate high quality data when it’s in the orders of tb’s or even petabytes?
Yes. So add relevant new data along with the older stuff. The problem is not that AI-generated content is magically “poison” somehow. Model collapse happens when you lose rare data from repeated generations of training data generated by AIs.
A simple way to imagine it is training an AI by showing it random coloured marbles out of a bucket and then asking it to fill the next AI’s bucket with new marbles to train on. If there’s just one single blue marble in the first bucket then it’s easily possible that the AI will fail to put a blue marble in the second bucket, after which there will never be a blue marble again if that’s all that subsequent AIs have to train off of. But if each time you train a new AI you reuse half the marbles from the first bucket again, you can have that blue marble show back up again in future AIs.
If LLMs are as revolutionary as the zealots believe, then there will exist less and less blue marbles in the universe with each iteration. So either the bucket gets smaller or the ratio of blue marbles gets smaller.
I said:
The original bucket containing the blue marble isn’t going anywhere. It still exists. The blue marble will always be available to mix into future AIs. All you have to do is make sure you’re using some historical data (or otherwise guaranteed “human-generated”) along with whatever new unvetted stuff you’re using.
So then your back to locking LLMs to the year 2023. They’re usefulness is severely limited if you can’t train them on new data.
Emphasis added. Please read more carefully, this is getting repetitive. You keep assuming that the AI will be trained either entirely with old data or entirely with new data and that’s just not the case.
And what happens when “whatever new unvetted stuff” is primarily comprised of AI-generated content?