vllm.v1.attention.ops.triton_unified_attention ¶
_cast_kv_tile ¶
Cast a loaded KV tile to Q's dtype, dequantizing if needed.
Modes handled inside the core kernel:
KV_QUANT_MODE == 0(NONE) and2(INT8 per-token-head) and3(FP8 per-token-head): plain cast. Per-token-head modes apply their scales separately on S/P inside the loop.KV_QUANT_MODE == 1(FP8 per-tensor): dequantize using the tensor-wide scale.
Sub-byte packed modes (INT4 / INT2) are dispatched to their own backends in :mod:vllm.v1.attention.ops.triton_quant_kv and never reach this kernel.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_get_tile_size ¶
Select tile size with Gemma3-specific optimization.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_is_gemma3_attention ¶
Detect Gemma3 models via unique (head_size, sliding_window) signature.
Gemma3 models are the only ones using sliding_window=1024 with head_size 128 (27B) or 256 (1B, 4B, 12B). Other SWA models use different window sizes (Mistral=4096, Phi-3=2047).