Europe's world-first AI rules are set for final approval. Here's what happens next

@girlfreddy@lemmy.ca · 4 months ago

Europe's world-first AI rules are set for final approval. Here's what happens next

@WalnutLum@lemmy.ml · 4 months ago

Doesn’t seem like it outside this:

Developers of general purpose AI models — from European startups to OpenAI and Google — will have to provide a detailed summary of the text, pictures, video and other data on the internet that is used to train the systems as well as follow EU copyright law.

Which makes me think that it’ll be used to require models to truly open their “source”

The FOSS community really needs to come up with a better definition and licensing model for LLMs and other neural networks, though. I’ve seen multiple times where people refer to freely provided pre-trained models as “open source”

AIs aren’t truly open source unless their training code and the training data is fully provided. Anything else is at most semi-obfuscated and definitely not “open”

@General_Effort@lemmy.world · 4 months ago

Which makes me think that it’ll be used to require models to truly open their “source”

I forgot to mention: That’s unlikely. It only requires a “summary”, which will be of limited use for reverse engineering the big models. It does, however, provide a club with which to beat small developers.

I don’t think many people who publish finetunes on huggingface (think github for AI models) will bother with this. I’m not sure what that would mean for the legality of HF on the whole.

@WalnutLum@lemmy.ml · 4 months ago

HF already has mechanisms for sharing datasets through the hub so I don’t think this would be a big lift for them legally

@General_Effort@lemmy.world · 4 months ago

Yes, and some of those datasets might be illegal in some EU countries, but that’s not the point. You need to have the copyright summary so that the model is compliant with EU regulations. Just hosting them for free download is probably fine, if I understand correctly.

@muntedcrocodile@lemmy.world · 4 months ago

Damn its actually helping foss another good one by the eu. Yeah people calling the llama models foss is just plain wrong and giving the zucc more credit than the deserves.

@General_Effort@lemmy.world · 4 months ago

Why do you need the training data? To me, if you can use it and modify it as you wish then it’s open source. If you need a copy of the training data then that’s a problem, even outside the EU.

Many (all?) of the so-called open source models have “ethical” restrictions on use, so technically not open. It’s close enough to me, for now. In the future, such clauses will become an issue. Imagine if printing presses came with restrictions on what you can and can’t print.

@9bananas@lemmy.world · 4 months ago

all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

these are just 2 examples, there’s many more.

also, you are thinking of LLMs, which is just one kind of model. this legislation applies to all AI models, not just LLMs!

(and your definition of open source is…unique.)

@General_Effort@lemmy.world · 4 months ago

all models carry bias (see recent gemini headlines for an extreme example), and what exactly those are can range from important to extremely important, depending on the use case!

it’s also important if you want to iterate on a model: if you use the same data set and train the model slightly differently, you could end up with entirely different models!

Meaning what?

(and your definition of open source is…unique.)

I omitted requirements on freely sharing it as implied, but otherwise?

@9bananas@lemmy.world · 4 months ago

Meaning what?

meaning the models training data is what lets you work around or improve on that bias. without the training data, that’s (borderline) impossible. so in order to tweak models and further development, you need to know what exactly went into the model, or you’ll spend a lot of wasted time guessing around.

I omitted requirements on freely sharing it as implied, but otherwise?

you disregarded half of what makes an AI model. the half that actually results in a working model. without the training data, you’d only have some code that does…something.

and that something is entirely dependent on the training data!

so it’s essential, not optional, for any kind of “open source” AI, because without it you’re working with a black box. which is by definition NOT open source.

@General_Effort@lemmy.world · 4 months ago

@WalnutLum@lemmy.ml

Asking for the training data is more like asking for detailed design documentation in addition to source code, so that others can rewrite the code from scratch.

Neural networks are inherently black boxes. Knowing the training data does little to change that. Given the sheer volume of data used in training the interesting models, more than very high level knowledge is not possible in any case.

There are open datasets, as well as open models. If open source models are only those trained on open datasets, then we need a new word for the status of most models. As it is, open source model and open source dataset is pretty clear. There’s no need to make it complicated.

If it also a requirement that the data itself should be downloadable, then open source AI would be illegal in many countries. Much of the data will be under copyright, meaning that it can’t be shared in many countries. EG, the original Stable Diffusion was trained on an open dataset. The dataset only contained links to images, since sharing the actual images would have been illegal in their jurisdiction. Link rot being what it is, the original data was not available pretty quickly. It has been alleged that some of the links pointed to CSAM, so now even the links are a hot potato.

meaning the models training data is what lets you work around or improve on that bias. without the training data,

Do you have any source that explains how this would work?

@WalnutLum@lemmy.ml · 4 months ago

Open sourcing the training method without open sourcing the training data is essentially like making only part of your full source open to the public.

Even going as far as making your training method source available, and a pre-trained kernel available (like what Mistral does) is essentially the same as what a lot of open source-adjacent companies provide.

A pre-trained neural kernel isn’t any different effectively than a pre-compiled binary library (like a dll). So what these companies are providing is closed-source binaries alongside the compilation instructions for them. But without the data that trained the kernel it can hardly be called “open source” as the actual “source” of the logic behind the kernel (the training data) is still closed to the public.

You can fine-tune and re-train and re-quantize the models all you want but you’re not really manipulating the “source” if all you have is the gptq or safetensors or some other pre-trained set of weights.