Tokens in LLM - Search News

Multi-token prediction technique triples LLM inference speed without auxiliary draft models

With reported 3x speed gains and limited degradation in output quality, the method targets one of the biggest pain points in production AI systems: latency at scale.

TMCnet

Inception Launches Mercury 2, the Fastest Reasoning LLM - 5x Faster Than Leading Speed-Optimized LLMs, with Dramatically Lower Inference Cost

Inception, the company behind the first commercial diffusion large language models (dLLMs), today announced the launch of Mercury 2, the fastest reasoning LLM and first reasoning dLLM. Mercury 2 ...

Researchers baked 3x inference speedups directly into LLM weights — without speculative decoding

Researchers from the University of Maryland, Lawrence Livermore, Columbia and TogetherAI have developed a training technique that triples LLM inference speed without auxiliary models or infrastructure ...

The Financial Express

Taalas HC1 AI chip hype explained: Why this Nvidia GPU-beating chip with 17,000 tokens per second speed is viral

Taalas HC1 with Llama 3.1 8B AI model can deliver near-instantaneous responses, even for detailed queries like a ...

12d

Nvidia’s new technique cuts LLM reasoning costs by 8x without losing accuracy

Nvidia researchers developed dynamic memory sparsification (DMS), a technique that compresses the KV cache in large language models by up to 8x while maintaining reasoning accuracy — and it can be ...

Anthropic Tool Calling Updates Cut Tokens 30–50% in Multi-Step Agent Tasks

Anthropic updates tool calling to reduce token use; tool search cuts tokens up to 80%, making larger tool sets practical.

SiliconANGLE

Writer announces Palmyra X5 LLM with 1M-token context window to power AI agents

Generative artificial intelligence startup Writer Inc. today released its newest state-of-the-art enterprise-focused large language model Palmyra X5, an adaptive reasoning model that features a 1 ...

Morningstar

EverMemOS Redefines Efficiency in AI Memory, Surpassing LLM Full-Context Perfomances with Far Fewer Tokens in Open Evaluation

The evaluation framework was developed to address a critical bottleneck in the AI industry: the absence of consistent, transparent methods to measure memory quality. Today's agents rely on a ...

Semiconductor Engineering

The On-Device LLM Revolution

Users running a quantized 7B model on a laptop expect 40+ tokens per second. A 30B MoE model on a high-end mobile device ...

InfoWorld

How to choose the best LLM using R and vitals

Use the vitals package with ellmer to evaluate and compare the accuracy of LLMs, including writing evals to test local models ...

Some results have been hidden because they may be inaccessible to you

Show inaccessible results