Could Reddit's data be "poisoned" to prevent its use in training AI?

nodsocket@lemmy.world · edit-2 8 months ago

Could Reddit's data be "poisoned" to prevent its use in training AI?

Natanael@slrpnk.net · 8 months ago

Training on synthetic data is not a quality improvement, it’s just an edge case reducer for a small set of edge cases by decreasing “overfitting”, and it is only even able to achieve that if you’re very very careful with what you add and how. If you’re ONLY training on AI generated data repeatedly then it does start to degrade and loose coherence after a few generations of training

FaceDeer@kbin.social · 8 months ago

Which is why nobody trains on ONLY AI generated data.

Really, experts have thought of this stuff already. Because they’re experts. Synthetic data means that the amount of “real” data required is much less, so giant repositories like Reddit aren’t so important.

Natanael@slrpnk.net · 8 months ago

No, “much less” training data isn’t possible with synthetic data. That’s not what it’s there for. The experts would tell you as much if you asked them.