https://github.com/oobabooga/text-generation-webui/commit/9f40032d32165773337e6a6c60de39d3f3beb77d
ExLlama is an extremely optimized GPTQ backend for LLaMA models. It features much lower VRAM usage and much higher speeds due to not relying on unoptimized transformers code.
It is highly optimized model loader for GPTQ models. It’s an alternative to options like AutoGPTQ or GPTQ-for-LLaMA, and provides faster text generation speeds.
With this update, anyone running GGML exclusively might find some interesting results switching over to a quantized model and testing the changes. I haven’t had a chance yet myself, but I will post some of my own benchmarks and results if I find the time for it.
I for one am excited to see the efficiency battles begin. Getting compute down is going to be the most important hurdles to overcome.
A comment from a Reddit user (Fuzzlewhumper) regarding these changes:
I would be curious to see if the efficiency change is that drastic. I will do my best to include my findings in the larger model benchmark post I am piecing together.