vllm.utils.torch_utils ¶
LayerName ¶
Bases: OpaqueBase
Wraps a module name string for use as a torch opaque type.
When torch >= 2.11, this is registered as a hoisted value-type opaque object so that torch.compile lifts it as a graph input instead of baking it as a constant. This avoids per-layer recompilation for custom ops that accept layer name strings (attention, MOE, KV cache, etc.).
Source code in vllm/utils/torch_utils.py
_encode_layer_name ¶
_nvfp4_split_data_scale ¶
Split a single NVFP4 KV-side buffer into data and scale views.
The input is a 4D tensor for one KV side (K or V) whose last dimension is full_dim = data_dim + scale_dim. The physical layout within each side is [data | scale], both packed contiguously.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_side | Tensor | 4D uint8 tensor with shape | required |
Returns:
| Type | Description |
|---|---|
Tensor |
|
Tensor |
|
tuple[Tensor, Tensor] |
|
tuple[Tensor, Tensor] |
|
tuple[Tensor, Tensor] |
|
Source code in vllm/utils/torch_utils.py
_resolve_layer_name ¶
async_tensor_h2d ¶
async_tensor_h2d(
data: list,
dtype: dtype,
target_device: str | device,
pin_memory: bool,
) -> Tensor
Asynchronously create a tensor and copy it from host to device.
Source code in vllm/utils/torch_utils.py
aux_stream ¶
aux_stream() -> Stream | None
Ensures aux_stream is initialized only once
Source code in vllm/utils/torch_utils.py
common_broadcastable_dtype ¶
common_broadcastable_dtype(dtypes: Collection[dtype])
Get the common dtype where all of the other dtypes can be cast to it without losing any information.
Source code in vllm/utils/torch_utils.py
current_stream ¶
current_stream() -> Stream
replace torch.cuda.current_stream() with vllm.utils.current_stream(). it turns out that torch.cuda.current_stream() is quite expensive, as it will construct a new stream object at each call. here we patch torch.cuda.set_stream to keep track of the current stream directly, so that we can avoid calling torch.cuda.current_stream().
the underlying hypothesis is that we do not call torch._C._cuda_setStream from C/C++ code.
Source code in vllm/utils/torch_utils.py
direct_register_custom_op ¶
direct_register_custom_op(
op_name: str,
op_func: Callable,
mutates_args: list[str] | None = None,
fake_impl: Callable | None = None,
target_lib: Library | None = None,
dispatch_key: str | None = None,
tags: tuple[Tag, ...] = (),
)
torch.library.custom_op can have significant overhead because it needs to consider complicated dispatching logic. This function directly registers a custom op and dispatches it to the CUDA backend. See https://gist.github.com/youkaichao/ecbea9ec9fc79a45d2adce1784d7a9a5 for more details.
By default, the custom op is registered to the vLLM library. If you want to register it to a different library, you can pass the library object to the target_lib argument.
IMPORTANT: the lifetime of the operator is tied to the lifetime of the library object. If you want to bind the operator to a different library, make sure the library object is alive when the operator is used.
Source code in vllm/utils/torch_utils.py
get_accelerator_view_from_cpu_tensor ¶
Get an accelerator view of a CPU tensor using Unified Virtual Addressing (UVA).
Source code in vllm/utils/torch_utils.py
get_dtype_size ¶
get_kv_cache_quant_algo_dtype ¶
Get the KV cache quantization algorithm dtype from the quantization config.
Source code in vllm/utils/torch_utils.py
get_kv_cache_quant_algo_string ¶
Get the KV cache quantization algorithm string from the quantization config.
Maps various FP8 format names to vLLM's standard cache dtype strings. Returns None if no kv_cache_quant_algo is specified. Returns "auto" if the value is not recognized/supported.
Source code in vllm/utils/torch_utils.py
guard_cuda_initialization ¶
Avoid unexpected CUDA initialization.
Source code in vllm/utils/torch_utils.py
is_lossless_cast ¶
Test whether it is lossless to cast a tensor from src_dtype to tgt_dtype.
Source code in vllm/utils/torch_utils.py
is_strictly_contiguous ¶
Check if tensor is contiguous AND has no degenerate strides.
A degenerate stride occurs when a dimension has size 1 but the stride doesn't match the canonical contiguous layout. This can cause issues in some CUDA kernels that rely on stride values for memory access.
For a C-contiguous tensor of shape (d0, d1, ..., dn), the expected strides are: stride[i] = product(shape[i+1:]) for all i, with stride[-1]=1.
Example with torch.Size([16, 1, 8, 32]): - Canonical strides: (256, 256, 32, 1) - Degenerate strides: (256, 1, 32, 1) # dim=1 has size=1, allowing # non-canonical stride in dim=0
Source code in vllm/utils/torch_utils.py
is_torch_equal ¶
Check if the installed torch version is == the target version.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target | str | a version string, like "2.6.0". | required |
Returns:
| Type | Description |
|---|---|
bool | Whether the condition meets. |
Source code in vllm/utils/torch_utils.py
is_torch_equal_or_newer ¶
Check if the installed torch version is >= the target version.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
target | str | a version string, like "2.6.0". | required |
Returns:
| Type | Description |
|---|---|
bool | Whether the condition meets. |
Source code in vllm/utils/torch_utils.py
kv_cache_uses_per_token_head_scales ¶
make_ndarray_with_pad ¶
make_ndarray_with_pad(
x: list[list[T]],
pad: T,
dtype: DTypeLike,
*,
max_len: int | None = None,
) -> NDArray
Make a padded array from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len.
Source code in vllm/utils/torch_utils.py
make_tensor_with_pad ¶
make_tensor_with_pad(
x: list[list[T]],
pad: T,
dtype: dtype,
*,
max_len: int | None = None,
device: str | device | None = None,
pin_memory: bool = False,
) -> Tensor
Make a padded tensor from 2D inputs.
The padding is applied to the end of each inner list until it reaches max_len.
Source code in vllm/utils/torch_utils.py
nvfp4_kv_cache_full_dim ¶
nvfp4_kv_cache_split_views ¶
Split an NVFP4 KV cache tensor into data and scale views.
Accepts either a 5D tensor (num_pages, 2, dim_2, dim_3, full_dim) or a 4D single-side tensor (num_pages, dim_2, dim_3, full_dim).
Per-page layout: [K_data | K_scale | V_data | V_scale]. Each KV side is self-contained (data followed by its scale), so the 5D case simply splits each side independently.
The returned views are in the same dim order as the input (NHD or HND), so callers get views matching whichever order they passed in.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
kv_cache | Tensor | 5D or 4D uint8 tensor where the last dimension is | required |
Returns:
| Type | Description |
|---|---|
tuple | For 5D input: |
tuple | For 4D input (single KV side): |
Source code in vllm/utils/torch_utils.py
resolve_kv_cache_dtype_string ¶
resolve_kv_cache_dtype_string(
kv_cache_dtype: str, model_config: ModelConfig
) -> str
Resolve 'auto' kv_cache_dtype to the actual string value from model config. Returns the resolved cache_dtype string.
Source code in vllm/utils/torch_utils.py
set_default_torch_dtype ¶
set_default_torch_dtype(dtype: dtype)
Sets the default torch dtype to the given dtype.
Source code in vllm/utils/torch_utils.py
set_default_torch_num_threads ¶
set_default_torch_num_threads(
num_threads: int | None = None,
)
Sets the default number of threads for PyTorch to the given value.
None means using the value of the environment variable OMP_NUM_THREADS (or 1 if that is not available).
Source code in vllm/utils/torch_utils.py
weak_ref_tensor ¶
Create a weak reference to a tensor. The new tensor will share the same data as the original tensor, but will not keep the original tensor alive. This ignores 0-size tensors as those don't allocate any memory.
Source code in vllm/utils/torch_utils.py
weak_ref_tensors ¶
weak_ref_tensors(
tensors: Tensor
| list[Tensor]
| tuple[Tensor]
| IntermediateTensors,
) -> Tensor | list[Any] | tuple[Any] | Any
Convenience function to create weak references to tensors, for single tensor, list of tensors or tuple of tensors.