vllm.v1.attention.ops.triton_reshape_and_cache_flash ¶
Core paged-cache reshape kernels.
This file owns the canonical (mode NONE / FP8 per-tensor) reshape kernels and the diff-kv variant. All per-token-head and packed-int modes (INT8 / FP8 / INT4 / INT2) live in dedicated backend modules under :mod:vllm.v1.attention.ops.triton_quant_kv.
For backwards compatibility this module still exposes triton_reshape_and_cache_flash_per_token_head_quant, fast_hadamard_transform and _single_rht as thin re-exports / dispatchers, so existing tests and benchmarks keep working.
triton_reshape_and_cache_flash_per_token_head_quant ¶
triton_reshape_and_cache_flash_per_token_head_quant(
key: Tensor,
value: Tensor,
key_cache: Tensor,
value_cache: Tensor,
k_scale_cache: Tensor,
v_scale_cache: Tensor,
slot_mapping: Tensor,
kv_quant_mode: KVQuantMode | None = None,
)
Quantize key/value per (token, head) and write to the paged cache.
Dispatches to the appropriate backend in :mod:vllm.v1.attention.ops.triton_quant_kv. When kv_quant_mode is None (legacy callers) the mode is inferred from the cache dtype — but this disambiguation cannot tell INT4 from INT2 (both stored as torch.uint8) and is deprecated; pass kv_quant_mode explicitly.