• fkn
    link
    fedilink
    English
    411 months ago

    I don’t know if I agree with everything you wrote but I think the argument about llms basically transforming the text is important.

    Converting written text into numbers doesn’t fundamentally change the text. It’s still the authors original work, just translated into a vector format. Reproduction of that vector format is still reproduction without citation.

    • ayaya
      link
      fedilink
      English
      6
      edit-2
      11 months ago

      But it’s not just converting them into a different format. It’s not even storing that information at all. It can’t actually reproduce anything from the dataset unless it is really small or completely overfitted, neither of which apply to GPT with how massive it is.

      Each neuron, which represents a word or a phrase, is a set of weights. One source makes a neuron go up by 0.000001% and then another source makes it go down by 0.000001%. And then you repeat that millions and millions of times. The model has absolutely zero knowledge of any specific source in its training data, it only knows how often different words and phrases occur next to each other. Or for images it only knows that certain pixels are weighted to be certain colors. Etc.

      • fkn
        link
        fedilink
        English
        -111 months ago

        This is a misunderstanding on your part. While some neurons are trained this way, word2vec and doc2vec are not these mechanisms. The llms are extensions of these models and while there are certainly some aspects of what you are describing, there is a transcription into vector formats.

        This is the power of vectorization of language (among other things). The one to one mapping between vectors and words / sentences to documents and so forth allows models to describe the distance between words or phrases using euclidian geometry.

        • ayaya
          link
          fedilink
          English
          2
          edit-2
          11 months ago

          I was trying to make it as simple as possible. The format is irrelevant. The model is still storing nothing but weights at the end of the day. Storing the relationships between words and sentences is not the same thing as storing works in a different format which is what your original comment implied.

          • fkn
            link
            fedilink
            English
            -111 months ago

            I’m sorry you failed to grasp how it works in this context.

        • discodoubloon
          link
          fedilink
          011 months ago

          You made me really interested in this concept so I asked GPT-4 what the furthest word away from the word “vectorization” would be.

          Interesting game! If we’re aiming for a word that’s conceptually, contextually, and semantically distant from “vectorization,” I’d pick “marshmallow.” While “vectorization” pertains to complex computational processes and mathematics, “marshmallow” is a soft, sweet confectionery. They’re quite far apart in terms of their typical contexts and meanings.

          It honestly never ceases to surprise me. I’m gonna play around with some more. I do really like the idea that it’s essentially a word calculator.

          • fkn
            link
            fedilink
            English
            311 months ago

            Try asking it how the vectorization of king and queen are related.