vllm.v1.attention.ops.triton_quant_kv ¶
Per-mode KV cache quantization backends.
The core attention kernel (:mod:vllm.v1.attention.ops.triton_unified_attention) handles modes NONE, FP8_PER_TENSOR, INT8_PER_TOKEN_HEAD and FP8_PER_TOKEN_HEAD directly via constexpr branches. Backends registered here own:
- the write side for any mode that needs more than a plain copy (per-token-head absmax, asymmetric INT4 with zero-point packing, INT2 Lloyd-Max + Hadamard, …); and
- the attention read side for sub-byte packed modes (INT4 / INT2) whose inner loop is structurally different from the core kernel (split-dot, centroid lookup, etc.).
Adding a new quantization mode¶
- Add a new value to :class:
KVQuantModeinvllm/v1/kv_cache_interface.py. - Add a new entry to
_MODULESbelow mapping the mode to a module path. - Create a new file under
quant_kv/that defines a subclass of :class:QuantKVBackendand calls :func:registerat module level. If the mode can use the core attention kernel, override onlyreshape_and_cache/allocate_scale_caches; otherwise also overrideunified_attention.
Modules:
| Name | Description |
|---|---|
base | Backend protocol for KV cache quantization modes. |
int8_fp8_per_token_head | INT8 and FP8 per-token-head KV cache quantization backends. |
packed_per_token_head | Sub-byte per-token-head KV cache quantization backends (INT4 + INT2). |
QuantKVBackend ¶
Bases: ABC
Cache write + attention read for one KV quantization mode.
Subclasses implement reshape_and_cache and unified_attention for a single :class:KVQuantMode, and call :func:vllm.v1.attention.ops.triton_quant_kv.register at module import. The dispatcher in :mod:triton_unified_attention and :mod:triton_reshape_and_cache_flash looks up the backend lazily on first use, so unused modes pay zero import or compile cost.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
allocate_scale_caches ¶
allocate_scale_caches(
num_blocks: int,
block_size: int,
num_kv_heads: int,
device: device,
) -> tuple[Tensor | None, Tensor | None]
Allocate aux per-(token, head) scale buffers.
Default: when needs_scale_caches is True, allocate one float32 per (block, slot, kv_head) for both K and V — the layout shared by every per-token-head mode (INT8 / FP8 store one absmax-derived scale; INT4 steganographs the zero-point in the low 4 mantissa bits of that scale; INT2 stores norm / d^1.5). Modes that need a different shape or dtype override this method. Modes that don't need scale caches at all (needs_scale_caches = False) get (None, None).
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
packed_head_size ¶
Storage head size after packing: head_size // packing_factor.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
reshape_and_cache abstractmethod ¶
reshape_and_cache(
key: Tensor,
value: Tensor,
key_cache: Tensor,
value_cache: Tensor,
slot_mapping: Tensor,
*,
k_scale_cache: Tensor | None = None,
v_scale_cache: Tensor | None = None,
) -> None
Write key/value into the paged cache for this mode.
Per-token-head modes also write into k_scale_cache / v_scale_cache.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
unified_attention ¶
unified_attention(
q: Tensor,
k_cache: Tensor,
v_cache: Tensor,
out: Tensor,
*,
cu_seqlens_q: Tensor,
max_seqlen_q: int,
seqused_k: Tensor,
max_seqlen_k: int,
softmax_scale: float,
window_size: tuple[int, int],
block_table: Tensor,
softcap: float,
sinks: Tensor | None,
alibi_slopes: Tensor | None,
use_alibi_sqrt: bool,
qq_bias: Tensor | None,
output_scale: Tensor | None,
mm_prefix_range: Tensor | None,
k_scale_cache: Tensor | None = None,
v_scale_cache: Tensor | None = None,
seq_threshold_3D: int | None = None,
num_par_softmax_segments: int | None = None,
softmax_segm_output: Tensor | None = None,
softmax_segm_max: Tensor | None = None,
softmax_segm_expsum: Tensor | None = None,
) -> None
Run paged attention with this mode's KV layout, writing into out.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
get_backend ¶
get_backend(mode: KVQuantMode) -> QuantKVBackend
Lazy-import and return the backend for mode.
Raises ValueError if no backend module is configured for mode.
Source code in vllm/v1/attention/ops/triton_quant_kv/__init__.py
has_backend ¶
has_backend(mode: KVQuantMode) -> bool