vLLM vs TensorRT-LLM vs HF TGI vs LMDeploy, A Deep Technical Comparison for Production LLM Inference

Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet.

This comparison focuses on 4 widely used stacks:

vLLM
NVIDIA TensorRT-LLM
Hugging Face Text Generation Inference (TGI v3)
LMDeploy

1. vLLM, PagedAttention as the open baseline

Core idea

vLLM is built around PagedAttention, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence.

Instead of allocating one big KV region per request, vLLM:

Divides KV cache into fixed size blocks
Maintains a block table that maps logical tokens to physical blocks
Shares blocks between sequences wherever prefixes overlap

This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.

Throughput and latency

vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences.

Key properties for operators:

Continuous batching (also called inflight batching) merges incoming requests into existing GPU batches instead of waiting for fixed batch windows.
On typical chat workloads, throughput scales close to linearly with concurrency until KV memory or compute saturates.
P50 latency remains low for moderate concurrency, but P99 can degrade once queues are long or KV memory is tight, especially for prefill heavy queries.

vLLM exposes an OpenAI compatible HTTP API and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.

KV and multi tenant

PagedAttention gives near zero KV waste and flexible prefix sharing within and across requests.
Each vLLM process serves one model, multi tenant and multi model setups are usually built with an external router or API gateway that fans out to multiple vLLM instances.

2. TensorRT-LLM, hardware maximum on NVIDIA GPUs

Core idea

TensorRT-LLM is NVIDIA’s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding.

It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.

Measured performance

NVIDIA’s H100 vs A100 evaluation is the most concrete public reference:

On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with ~100 ms time to first token.
H100 FP8 achieves up to 4.6× higher max throughput and 4.4× faster first token latency than A100 on the same models.

For latency sensitive modes:

TensorRT-LLM on H100 can drive TTFT below 10 ms in batch 1 configurations, at the cost of lower overall throughput.

These numbers are model and shape specific, but they give a realistic scale.

Prefill vs decode

TensorRT-LLM optimizes both phases:

Prefill benefits from high throughput FP8 attention kernels and tensor parallelism
Decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion

The result is very high tokens/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.

KV and multi tenant

TensorRT-LLM provides:

Paged KV cache with configurable layout
Support for long sequences, KV reuse and offloading
Inflight batching and priority aware scheduling primitives

NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance.

3. Hugging Face TGI v3, long prompt specialist and multi backend gateway

Core idea

Text Generation Inference (TGI) is a Rust and Python based serving stack that adds:

HTTP and gRPC APIs
Continuous batching scheduler
Observability and autoscaling hooks
Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes

Version 3 focuses on long prompt processing through chunking and prefix caching.

Long prompt benchmark vs vLLM

The TGI v3 docs give a clear benchmark:

On long prompts with more than 200,000 tokens, a conversation reply that takes 27.5 s in vLLM can be served in about 2 s in TGI v3.
This is reported as a 13× speedup on that workload.
TGI v3 is able to process about 3× more tokens in the same GPU memory by reducing its memory footprint and exploiting chunking and caching.

The mechanism is:

TGI keeps the original conversation context in a prefix cache, so subsequent turns only pay for incremental tokens
Cache lookup overhead is on the order of microseconds, negligible relative to prefill compute

This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization.

Architecture and latency behavior

Key components:

Chunking, very long prompts are split into manageable segments for KV and scheduling
Prefix caching, data structure to share long context across turns
Continuous batching, incoming requests join batches of already running sequences
PagedAttention and fused kernels in the GPU backends

For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill.

Multi backend and multi model

TGI is designed as a router plus model server architecture. It can:

Route requests across many models and replicas
Target different backends, for example TensorRT-LLM on H100 plus CPU or smaller GPUs for low priority traffic

This makes it suitable as a central serving tier in multi tenant environments.

4. LMDeploy, TurboMind with blocked KV and aggressive quantization

Core idea

LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the TurboMind engine. It focuses on:

High throughput request serving
Blocked KV cache
Persistent batching (continuous batching)
Quantization of weights and KV cache

Relative throughput vs vLLM

The project states:

‘LMDeploy delivers up to 1.8× higher request throughput than vLLM‘, with the support from persistent batch, blocked KV, dynamic split and fuse, tensor parallelism and optimized CUDA kernels.

KV, quantization and latency

LMDeploy includes:

Blocked KV cache, similar to paged KV, that helps pack many sequences into VRAM
Support for KV cache quantization, typically int8 or int4, to cut KV memory and bandwidth
Weight only quantization paths such as 4 bit AWQ
A benchmarking harness that reports token throughput, request throughput, and first token latency

This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid range GPUs with aggressive compression while still maintaining good tokens/s.

Multi model deployments

LMDeploy provides a proxy server able to handle:

Multi model deployments
Multi machine, multi GPU setups
Routing logic to select models based on request metadata

So architecturally it sits closer to TGI than to a single engine.

What to use when?

If you want maximum throughput and very low TTFT on NVIDIA GPUs
- TensorRT-LLM is the primary choice
- It uses FP8 and lower precision, custom kernels and speculative decoding to push tokens/s and keep TTFT under 100 ms at high concurrency and under 10 ms at low concurrency
If you are dominated by long prompts with reuse, such as RAG over large contexts
- TGI v3 is a strong default
- Its prefix cache and chunking give up to 3× token capacity and 13× lower latency than vLLM in published long prompt benchmarks, without extra configuration
If you want an open, simple engine with strong baseline performance and an OpenAI style API
- vLLM remains the standard baseline
- PagedAttention and continuous batching make it 2–4× faster than older stacks at similar latency, and it integrates cleanly with Ray and K8s
If you target open models such as InternLM or Qwen and value aggressive quantization with multi model serving
- LMDeploy is a good fit
- Blocked KV cache, persistent batching and int8 or int4 KV quantization give up to 1.8× higher request throughput than vLLM on supported models, with a router layer included

In practice, many dev teams mix these systems, for example using TensorRT-LLM for high volume proprietary chat, TGI v3 for long context analytics, vLLM or LMDeploy for experimental and open model workloads. The key is to align throughput, latency tails, and KV behavior with the actual token distributions in your traffic, then compute cost per million tokens from measured tokens/s on your own hardware.

References

vLLM / PagedAttention
TensorRT-LLM performance and overview
HF Text Generation Inference (TGI v3) long-prompt behavior
LMDeploy / TurboMind

Michal Sutter is a data science professional with a Master of Science in Data Science from the University of Padova. With a solid foundation in statistical analysis, machine learning, and data engineering, Michal excels at transforming complex datasets into actionable insights.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Source link