https://github.com/oobabooga/text-generation-webui/commit/9f40032d32165773337e6a6c60de39d3f3beb77d

ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.

It is highly optimized model loader for GPTQ models. It’s an alternative to options like AutoGPTQ or GPTQ-for-LLaMA, and provides faster text generation speeds.

With this update, anyone running GGML exclusively might find some interesting results switching over to a quantized model and testing the changes. I haven’t had a chance yet myself, but I will post some of my own benchmarks and results if I find the time for it.

I for one am excited to see the efficiency battles begin. Getting compute down is going to be the most important hurdles to overcome.

  • ArkyonVeil@lemmy.world
    link
    fedilink
    English
    arrow-up
    2
    ·
    1 year ago

    I’m actually playing around with EXLlama, IIRC it works with pretty much every model, and it can be a real game changer specially for long conversations, code, or stories.

    Unfortunately there is still the unavoidable problem of the context length burning VRAM like no tomorrow. You either get a decent AI with the attention span of a gold fish or an idiot AI which can remember 3 times as much stuff as before.

    Handy, progress, but ultimately there is still ground to cover.

    • Blaed@lemmy.worldOPM
      link
      fedilink
      English
      arrow-up
      2
      ·
      edit-2
      1 year ago

      I keep hearing about this EXLama! I really got to try it. Glad to hear it’s going well for you.

      I think it’s only a matter of time until context length is no longer an issue. I’m curious to see how RWKV develops, its infinite context length is interesting.

      I hope they make some major breakthroughs, I like the idea of a super massive RNN, but a transformer with infinite context length could be a game changer for both architectures.

      • ArkyonVeil@lemmy.world
        link
        fedilink
        English
        arrow-up
        1
        ·
        1 year ago

        It would be absolutely awesome, with infinite context length that would mean a much greater ease when it comes to handling models. I can be lazy and instead of creating a LORA, just use an entire book’s style as a reference right there in the prompt.

        For programmers, just dump the entire codebase, or Documentation.

        Of course, all this is only possible if VRAM is less of a bottleneck than it currently is, as well as the fact that it can reliably reference information on an arbitrarily large context. (Not much use having huge context if performance degrades, it loses its marbles or forgets key pieces of information along the way)