• artificialfish@programming.dev
    link
    fedilink
    English
    arrow-up
    1
    ·
    2 days ago

    You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).

    You MIGHT even be able to do that while masking the data using hashing.

    • tal@lemmy.today
      link
      fedilink
      English
      arrow-up
      1
      ·
      edit-2
      2 days ago

      You could always just do reverse search on the open dataset to see if it’s an exact copy (or over a threshold).

      True, but “exact copy” almost certainly isn’t going to be what gets produced – and you can have a derivative work that isn’t an exact copy of the original, just generate something that looks a lot like part of the original. Like, you’d want to have a pretty good chance of finding a derivative work.

      And that would mean that anyone who generates a model to would need to provide access their training corpus, which is gonna be huge – the models, which themselves are large, are a tiny fraction the size of the training set – and I’m sure that some people generating models aren’t gonna want to provide all of their training corpus.