With reported 3x speed gains and limited degradation in output quality, the method targets one of the biggest pain points in production AI systems: latency at scale.
Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 ...
Researchers from the University of Maryland, Lawrence Livermore, Columbia and TogetherAI have developed a training technique that triples LLM inference speed without auxiliary models or infrastructure ...
Taalas HC1 with Llama 3.1 8B AI model can deliver near-instantaneous responses, even for detailed queries like a ...
Nvidia researchers developed dynamic memory sparsification (DMS), a technique that compresses the KV cache in large language models by up to 8x while maintaining reasoning accuracy — and it can be ...
Anthropic updates tool calling to reduce token use; tool search cuts tokens up to 80%, making larger tool sets practical.
Generative artificial intelligence startup Writer Inc. today released its newest state-of-the-art enterprise-focused large language model Palmyra X5, an adaptive reasoning model that features a 1 ...
The evaluation framework was developed to address a critical bottleneck in the AI industry: the absence of consistent, transparent methods to measure memory quality. Today's agents rely on a ...
Users running a quantized 7B model on a laptop expect 40+ tokens per second. A 30B MoE model on a high-end mobile device ...
Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results