vllm.v1.attention.ops.triton_attention_helpers ¶
Shared @triton.jit helpers used by the unified attention kernel, reduce_segments and the sub-byte packed KV backends.
These are plain attention-loop helpers — mask building, ALiBi / QQ-bias score post-processing, online-softmax bookkeeping, tile-loop bounds, sequence lookup — and have nothing to do with KV cache quantization. Keeping them in one place ensures a fix in one (e.g. a sliding-window edge case) lands in every kernel that uses them.
apply_alibi_to_score ¶
apply_alibi_to_score(
S,
alibi_slope,
seq_offset,
context_len,
query_pos,
USE_ALIBI_SQRT: constexpr,
)
Add the ALiBi positional bias (linear or sqrt variant) to S in-place.
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
apply_softcap ¶
Softcap (aka tanh-style clamp) used to bound attention scores.
x * tanh(S / x) rewritten to avoid a direct tanh call.
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
cdiv_fn ¶
compute_kv_seq_mask ¶
compute_kv_seq_mask(
query_abs_pos,
seq_offset,
seq_idx,
mm_prefix_range_ptr,
SLIDING_WINDOW: constexpr,
USE_MM_PREFIX: constexpr,
MAX_MM_RANGES: constexpr,
)
Build the KV mask for one tile.
Causal (key <= query) by default; AND-ed with the sliding window when enabled; OR-ed with the bidirectional ranges from mm_prefix_range when PrefixLM / multimodal attention is active. The order matches FlexAttention: (causal AND sliding_window) OR mm_prefix.
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
compute_tile_loop_bounds ¶
compute_tile_loop_bounds(
context_len,
seq_len,
cur_batch_query_len,
q_block_local_idx,
segm_idx_or_0,
tiles_per_segment_or_0,
TILE_SIZE: constexpr,
BLOCK_M: constexpr,
BLOCK_Q: constexpr,
num_queries_per_kv: constexpr,
SLIDING_WINDOW: constexpr,
USE_MM_PREFIX: constexpr,
IS_3D: constexpr,
)
Compute the tile-loop bounds (loop_lo, loop_hi) and the derived max_seq_prefix_len used for per-tile masking.
Combines three concerns into one helper:
- Longest prefix spanned by any query token in this q-block. Clamped to
seq_len(causal) or extended to it when mm_prefix is active (bidirectional ranges can reach past the causal prefix). - Sliding-window pruning: narrows
[tile_start, tile_end)to only tiles that can contain an allowed key under SWA. - 3D scoping: when
IS_3Dis True, further narrows to the segment's slice via(segm_idx * tiles_per_segment, (segm_idx + 1) * tiles_per_segment).
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
find_seq_idx ¶
find_seq_idx(
query_start_len_ptr,
target_idx,
num_seqs,
BLOCK_Q: constexpr,
use_q_block_mode: constexpr,
)
Binary search over the cumulative query-length prefix.
When use_q_block_mode is True, the prefix values are reshaped into units of BLOCK_Q plus one entry per boundary — matching the q-block grid laid out by the attention kernels. When False we search the plain cumulative-length prefix (used by reduce_segments which iterates over raw query tokens).
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
init_softmax_M ¶
init_softmax_M(
sink_ptr,
query_offset_1,
query_mask_1,
segm_idx_or_0,
BLOCK_M: constexpr,
USE_SINKS: constexpr,
IS_3D: constexpr,
)
Initial row-max M for the online softmax.
Without sinks: -inf. With sinks: load the per-head sink bias once. In 3D mode only segment 0 loads — reduce_segments adds the sink contribution exactly once across segments, so other segments must start from -inf.
segm_idx_or_0 is the 3D segment index or 0 for 2D (caller passes 0 when IS_3D is False).
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
load_qq_bias_tile ¶
Load the qq-bias slice for keys that correspond to query rows.
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
resolve_seq_and_query_len ¶
resolve_seq_and_query_len(
query_start_len_ptr,
seq_lens_ptr,
q_block_global_idx,
num_seqs,
BLOCK_Q: constexpr,
)
Resolve the (sequence, q-block-within-sequence) pair and load the per-sequence lengths.
Shared across every attention kernel — the q_block_global_idx program id indexes into the flattened (seq, q_block_in_seq) space, and a binary search over query_start_len_ptr recovers the (seq, local-q-block) pair.
Returns (seq_idx, q_block_local_idx, cur_batch_in_all_start_index, cur_batch_query_len, seq_len). Callers must still early-return when q_block_local_idx * BLOCK_Q >= cur_batch_query_len (Triton helpers cannot return from the caller).
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
softmax_step ¶
Online softmax update for one tile.
Returns (M_new, L_new, P, alpha). Caller is responsible for rescaling its accumulator(s) by alpha[:, None] — done outside because the number / shape of accumulators varies between kernels (1 in the core, 2 in INT4 split-dot, 4 in INT2 quartet-dot).
Source code in vllm/v1/attention/ops/triton_attention_helpers.py
store_segm_reduce_scalars ¶
store_segm_reduce_scalars(
segm_max_ptr,
segm_expsum_ptr,
query_offset_0,
query_offset_1,
segm_idx,
M,
L,
query_mask_0,
query_mask_1,
num_query_heads: constexpr,
NUM_SEGMENTS_PER_SEQ: constexpr,
)
Store per-segment M and L for reduce_segments to combine into the final softmax.
Shared across every 3D attention epilogue; the per-token output stripes are mode-specific (flat / 2-stream split / 4-stream split) and stay inlined.