Production LLM serving is now a systems problem, not a generate() loop. For real workloads, the choice of inference stack drives your tokens per second, tail latency, and ultimately cost per million tokens on a given GPU fleet.
This comparison focuses on 4 widely used stacks:
- vLLM
- NVIDIA TensorRT-LLM
- Hugging Face Text Generation Inference (TGI v3)
- LMDeploy

1. vLLM, PagedAttention as the open baseline
Core idea
vLLM is built around PagedAttention, an attention implementation that treats the KV cache like paged virtual memory rather than a single contiguous buffer per sequence.
Instead of allocating one big KV region per request, vLLM:
- Divides KV cache into fixed size blocks
- Maintains a block table that maps logical tokens to physical blocks
- Shares blocks between sequences wherever prefixes overlap
This reduces external fragmentation and lets the scheduler pack many more concurrent sequences into the same VRAM.
Throughput and latency
vLLM improves throughput by 2–4× over systems like FasterTransformer and Orca at similar latency, with larger gains for longer sequences.
Key properties for operators:
- Continuous batching (also called inflight batching) merges incoming requests into existing GPU batches instead of waiting for fixed batch windows.
- On typical chat workloads, throughput scales close to linearly with concurrency until KV memory or compute saturates.
- P50 latency remains low for moderate concurrency, but P99 can degrade once queues are long or KV memory is tight, especially for prefill heavy queries.
vLLM exposes an OpenAI compatible HTTP API and integrates well with Ray Serve and other orchestrators, which is why it is widely used as an open baseline.
KV and multi tenant
- PagedAttention gives near zero KV waste and flexible prefix sharing within and across requests.
- Each vLLM process serves one model, multi tenant and multi model setups are usually built with an external router or API gateway that fans out to multiple vLLM instances.
2. TensorRT-LLM, hardware maximum on NVIDIA GPUs
Core idea
TensorRT-LLM is NVIDIA’s optimized inference library for their GPUs. The library provides custom attention kernels, inflight batching, paged KV caching, quantization down to FP4 and INT4, and speculative decoding.
It is tightly coupled to NVIDIA hardware, including FP8 tensor cores on Hopper and Blackwell.
Measured performance
NVIDIA’s H100 vs A100 evaluation is the most concrete public reference:
- On H100 with FP8, TensorRT-LLM reaches over 10,000 output tokens/s at peak throughput for 64 concurrent requests, with ~100 ms time to first token.
- H100 FP8 achieves up to 4.6× higher max throughput and 4.4× faster first token latency than A100 on the same models.
For latency sensitive modes:
- TensorRT-LLM on H100 can drive TTFT below 10 ms in batch 1 configurations, at the cost of lower overall throughput.
These numbers are model and shape specific, but they give a realistic scale.
Prefill vs decode
TensorRT-LLM optimizes both phases:
- Prefill benefits from high throughput FP8 attention kernels and tensor parallelism
- Decode benefits from CUDA graphs, speculative decoding, quantized weights and KV, and kernel fusion
The result is very high tokens/s across a wide range of input and output lengths, especially when the engine is tuned for that model and batch profile.
KV and multi tenant
TensorRT-LLM provides:
- Paged KV cache with configurable layout
- Support for long sequences, KV reuse and offloading
- Inflight batching and priority aware scheduling primitives
NVIDIA pairs this with Ray based or Triton based orchestration patterns for multi tenant clusters. Multi model support is done at the orchestrator level, not inside a single TensorRT-LLM engine instance.
3. Hugging Face TGI v3, long prompt specialist and multi backend gateway
Core idea
Text Generation Inference (TGI) is a Rust and Python based serving stack that adds:
- HTTP and gRPC APIs
- Continuous batching scheduler
- Observability and autoscaling hooks
- Pluggable backends, including vLLM style engines, TensorRT-LLM, and other runtimes
Version 3 focuses on long prompt processing through chunking and prefix caching.
Long prompt benchmark vs vLLM
The TGI v3 docs give a clear benchmark:
- On long prompts with more than 200,000 tokens, a conversation reply that takes 27.5 s in vLLM can be served in about 2 s in TGI v3.
- This is reported as a 13× speedup on that workload.
- TGI v3 is able to process about 3× more tokens in the same GPU memory by reducing its memory footprint and exploiting chunking and caching.
The mechanism is:
- TGI keeps the original conversation context in a prefix cache, so subsequent turns only pay for incremental tokens
- Cache lookup overhead is on the order of microseconds, negligible relative to prefill compute
This is a targeted optimization for workloads where prompts are extremely long and reused across turns, for example RAG pipelines and analytic summarization.
Architecture and latency behavior
Key components:
- Chunking, very long prompts are split into manageable segments for KV and scheduling
- Prefix caching, data structure to share long context across turns
- Continuous batching, incoming requests join batches of already running sequences
- PagedAttention and fused kernels in the GPU backends
For short chat style workloads, throughput and latency are in the same ballpark as vLLM. For long, cacheable contexts, both P50 and P99 latency improve by an order of magnitude because the engine avoids repeated prefill.
Multi backend and multi model
TGI is designed as a router plus model server architecture. It can:
- Route requests across many models and replicas
- Target different backends, for example TensorRT-LLM on H100 plus CPU or smaller GPUs for low priority traffic
This makes it suitable as a central serving tier in multi tenant environments.
4. LMDeploy, TurboMind with blocked KV and aggressive quantization
Core idea
LMDeploy from the InternLM ecosystem is a toolkit for compressing and serving LLMs, centered around the TurboMind engine. It focuses on:
- High throughput request serving
- Blocked KV cache
- Persistent batching (continuous batching)
- Quantization of weights and KV cache
Relative throughput vs vLLM
The project states:
- ‘LMDeploy delivers up to 1.8× higher request throughput than vLLM‘, with the support from persistent batch, blocked KV, dynamic split and fuse, tensor parallelism and optimized CUDA kernels.
KV, quantization and latency
LMDeploy includes:
- Blocked KV cache, similar to paged KV, that helps pack many sequences into VRAM
- Support for KV cache quantization, typically int8 or int4, to cut KV memory and bandwidth
- Weight only quantization paths such as 4 bit AWQ
- A benchmarking harness that reports token throughput, request throughput, and first token latency
This makes LMDeploy attractive when you want to run larger open models like InternLM or Qwen on mid range GPUs with aggressive compression while still maintaining good tokens/s.
Multi model deployments
LMDeploy provides a proxy server able to handle:
- Multi model deployments
- Multi machine, multi GPU setups
- Routing logic to select models based on request metadata
So architecturally it sits closer to TGI than to a single engine.
What to use when?
- If you want maximum throughput and very low TTFT on NVIDIA GPUs
- TensorRT-LLM is the primary choice
- It uses FP8 and lower precision, custom kernels and speculative decoding to push tokens/s and keep TTFT under 100 ms at high concurrency and under 10 ms at low concurrency
- If you are dominated by long prompts with reuse, such as RAG over large contexts
- TGI v3 is a strong default
- Its prefix cache and chunking give up to 3× token capacity and 13× lower latency than vLLM in published long prompt benchmarks, without extra configuration
- If you want an open, simple engine with strong baseline performance and an OpenAI style API
- vLLM remains the standard baseline
- PagedAttention and continuous batching make it 2–4× faster than older stacks at similar latency, and it integrates cleanly with Ray and K8s
- If you target open models such as InternLM or Qwen and value aggressive quantization with multi model serving
- LMDeploy is a good fit
- Blocked KV cache, persistent batching and int8 or int4 KV quantization give up to 1.8× higher request throughput than vLLM on supported models, with a router layer included
In practice, many dev teams mix these systems, for example using TensorRT-LLM for high volume proprietary chat, TGI v3 for long context analytics, vLLM or LMDeploy for experimental and open model workloads. The key is to align throughput, latency tails, and KV behavior with the actual token distributions in your traffic, then compute cost per million tokens from measured tokens/s on your own hardware.
References
- vLLM / PagedAttention
- TensorRT-LLM performance and overview
- HF Text Generation Inference (TGI v3) long-prompt behavior
- LMDeploy / TurboMind

