One of the arguments made for Reddit’s API changes is that they are now the go to place for LLM training data (e.g. for ChatGPT).
I haven’t seen a whole lot of discussion around this and would like to hear people’s opinions. Are you concerned about your posts being used for LLM training? Do you not care? Do you prefer that your comments are available to train open source LLMs?
(I will post my personal opinion in a comment so it can be up/down voted separately)
Certainly the archived Reddit posts will be used for that for years to come regardless. What I am curious about is how do you feel about your posts contributing to the output of a LLM (independent of API usage costs)?
LLMs can be specialized to tasks by training them further on a curated set of data. For example, a LLM trained specifically on your posts will sound more like you than the LLM before the training. Does it bother you that someone may use your posts for this purpose?
Well, these AIs are being trained on public figures, and there isn’t much they can do unless they livestream with the AI impersonating them, allowing them to potentially identify who is behind it. How will people figure out if there’s an LLM out there that speaks just like them? It’s similar to fine-tuning AIs on artists to create art that mimics their style. It can be frustrating, but there isn’t much anyone can do unless surveillance software is installed on every computer. In summary, I don’t mind because I won’t even find out.