vllm.v1.attention.ops.triton_quant_kv.base ¶
Backend protocol for KV cache quantization modes.
A QuantKVBackend owns the cache write path (reshape_and_cache) and the attention read path (unified_attention) for one KVQuantMode. The core attention/reshape kernels stay quantization-agnostic; each mode that wants its own data layout, packing, or pre-rotation lives in a self-contained module under quant_kv/ and registers an instance of this class on import.
QuantKVBackend ¶
Bases: ABC
Cache write + attention read for one KV quantization mode.
Subclasses implement reshape_and_cache and unified_attention for a single :class:KVQuantMode, and call :func:vllm.v1.attention.ops.triton_quant_kv.register at module import. The dispatcher in :mod:triton_unified_attention and :mod:triton_reshape_and_cache_flash looks up the backend lazily on first use, so unused modes pay zero import or compile cost.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 | |
allocate_scale_caches ¶
allocate_scale_caches(
num_blocks: int,
block_size: int,
num_kv_heads: int,
device: device,
) -> tuple[Tensor | None, Tensor | None]
Allocate aux per-(token, head) scale buffers.
Default: when needs_scale_caches is True, allocate one float32 per (block, slot, kv_head) for both K and V — the layout shared by every per-token-head mode (INT8 / FP8 store one absmax-derived scale; INT4 steganographs the zero-point in the low 4 mantissa bits of that scale; INT2 stores norm / d^1.5). Modes that need a different shape or dtype override this method. Modes that don't need scale caches at all (needs_scale_caches = False) get (None, None).
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
packed_head_size ¶
Storage head size after packing: head_size // packing_factor.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
reshape_and_cache abstractmethod ¶
reshape_and_cache(
key: Tensor,
value: Tensor,
key_cache: Tensor,
value_cache: Tensor,
slot_mapping: Tensor,
*,
k_scale_cache: Tensor | None = None,
v_scale_cache: Tensor | None = None,
) -> None
Write key/value into the paged cache for this mode.
Per-token-head modes also write into k_scale_cache / v_scale_cache.
Source code in vllm/v1/attention/ops/triton_quant_kv/base.py
unified_attention ¶
unified_attention(
q: Tensor,
k_cache: Tensor,
v_cache: Tensor,
out: Tensor,
*,
cu_seqlens_q: Tensor,
max_seqlen_q: int,
seqused_k: Tensor,
max_seqlen_k: int,
softmax_scale: float,
window_size: tuple[int, int],
block_table: Tensor,
softcap: float,
sinks: Tensor | None,
alibi_slopes: Tensor | None,
use_alibi_sqrt: bool,
qq_bias: Tensor | None,
output_scale: Tensor | None,
mm_prefix_range: Tensor | None,
k_scale_cache: Tensor | None = None,
v_scale_cache: Tensor | None = None,
seq_threshold_3D: int | None = None,
num_par_softmax_segments: int | None = None,
softmax_segm_output: Tensor | None = None,
softmax_segm_max: Tensor | None = None,
softmax_segm_expsum: Tensor | None = None,
) -> None
Run paged attention with this mode's KV layout, writing into out.