Metadata-Version: 2.4 Name: nvidia-cusparselt-cu13 Version: 0.8.1 Summary: NVIDIA cuSPARSELt Home-page: https://developer.nvidia.com/cusparselt Author: NVIDIA Corporation Author-email: cuda_installer@nvidia.com License: NVIDIA Proprietary Software Keywords: cuda,nvidia,machine learning,high-performance computing Classifier: Topic :: Scientific/Engineering Classifier: Environment :: GPU :: NVIDIA CUDA Classifier: Environment :: GPU :: NVIDIA CUDA :: 13 Description-Content-Type: text/x-rst Dynamic: author Dynamic: author-email Dynamic: classifier Dynamic: description Dynamic: description-content-type Dynamic: home-page Dynamic: keywords Dynamic: license Dynamic: summary ################################################################################### cuSPARSELt: A High-Performance CUDA Library for Sparse Matrix-Matrix Multiplication ################################################################################### **NVIDIA cuSPARSELt** is a high-performance CUDA library dedicated to general matrix-matrix operations in which at least one operand is a structured sparse matrix with 50\% sparsity ratio: .. math:: D = Activation(\alpha op(A) \cdot op(B) + \beta op(C) + bias) where :math:`op(A)/op(B)` refers to in-place operations such as transpose/non-transpose, and :math:`alpha, beta` are scalars or vectors. The *cuSPARSELt APIs* allow flexibility in the algorithm/operation selection, epilogue, and matrix characteristics, including memory layout, alignment, and data types. **Download:** `developer.nvidia.com/cusparselt/downloads `_ **Provide Feedback:** `Math-Libs-Feedback@nvidia.com `_ **Examples**: `cuSPARSELt Example 1 `_, `cuSPARSELt Example 2 `_ **Blog post**: - `Exploiting NVIDIA Ampere Structured Sparsity with cuSPARSELt `_ - `Structured Sparsity in the NVIDIA Ampere Architecture and Applications in Search Engines `__ - `Making the Most of Structured Sparsity in the NVIDIA Ampere Architecture `__ ================================================================================ Key Features ================================================================================ * *NVIDIA Sparse MMA tensor core* support * Mixed-precision computation support: +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | Input A/B | Input C | Output D | Compute | Block scaled | Support SM arch | +==============+================+=================+=============+=================================+====================+ | `FP32` | `FP32` | `FP32` | `FP32` | No | | +--------------+----------------+-----------------+-------------+ + | | `BF16` | `BF16` | `BF16` | `FP32` | | `8.0, 8.6, 8.7` | +--------------+----------------+-----------------+-------------+ + `9.0, 10.0, 10.1` | | `FP16` | `FP16` | `FP16` | `FP32` | | `11.0, 12.0, 12.1` | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `FP16` | `FP16` | `FP16` | `FP16` | No | `9.0` | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `INT8` | `INT8` | `INT8` | `INT32` | No | | + +----------------+-----------------+ + + `8.0, 8.6, 8.7` + | | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` | + +----------------+-----------------+ + + `11.0, 12.0, 12.1` + | | `FP16` | `FP16` | | | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `INT8` | `INT8` | `INT8` | `INT32` | No | | + +----------------+-----------------+ + + `8.0, 8.6, 8.7` + | | `INT32` | `INT32` | | | `9.0, 10.0, 10.1` | + +----------------+-----------------+ + + `11.0, 12.0, 12.1` + | | `FP16` | `FP16` | | | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `E4M3` | `FP16` | `E4M3` | `FP32` | No | `9.0, 10.0, 10.1` | + +----------------+-----------------+ + + `11.0, 12.0, 12.1` + | | `BF16` | `E4M3` | | | | + +----------------+-----------------+ + + + | | `FP16` | `FP16` | | | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | + +----------------+-----------------+ + + + | | `FP32` | `FP32` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `E5M2` | `FP16` | `E5M2` | `FP32` | No | `9.0, 10.0, 10.1` | + +----------------+-----------------+ + + `11.0, 12.0, 12.1` + | | `BF16` | `E5M2` | | | | + +----------------+-----------------+ + + + | | `FP16` | `FP16` | | | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | + +----------------+-----------------+ + + + | | `FP32` | `FP32` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `E4M3` | `FP16` | `E4M3` | `FP32` | A/B/D_OUT_SCALE = `VEC64_UE8M0` | `10.0, 10.1, 11.0` | + +----------------+-----------------+ + + `12.0, 12.1` + | | `BF16` | `E4M3` | | D_SCALE = `32F` | | + +----------------+-----------------+ +---------------------------------+ + | | `FP16` | `FP16` | | A/B_SCALE = `VEC64_UE8M0` | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | + +----------------+-----------------+ + + + | | `FP32` | `FP32` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ | `E2M1` | `FP16` | `E2M1` | `FP32` | A/B/D_SCALE = `VEC32_UE4M3` | `10.0, 10.1, 11.0` | + +----------------+-----------------+ + + `12.0, 12.1` + | | `BF16` | `E2M1` | | D_SCALE = `32F` | | + +----------------+-----------------+ +---------------------------------+ + | | `FP16` | `FP16` | | A/B_SCALE = `VEC32_UE4M3` | | + +----------------+-----------------+ + + + | | `BF16` | `BF16` | | | | + +----------------+-----------------+ + + + | | `FP32` | `FP32` | | | | +--------------+----------------+-----------------+-------------+---------------------------------+--------------------+ * Matrix pruning and compression functionalities * Activation functions, bias vector, and output scaling * Batched computation (multiple matrices in a single run) * GEMM Split-K mode * Auto-tuning functionality (see `cusparseLtMatmulSearch()`) * NVTX ranging and Logging functionalities ================================================================================ Support ================================================================================ * *Supported SM Architectures*: `SM 8.0`, `SM 8.6`, `SM 8.7`, `SM 8.9`, `SM 9.0`, `SM 10.0`, `SM 10.1` (for CTK 12), `SM 11.0` (for CTK 13), `SM 12.0`, `SM 12.1` * *Supported CPU architectures and operating systems*: +------------+--------------------+ | OS | CPU archs | +============+====================+ | `Windows` | `x86_64` | +------------+--------------------+ | `Linux` | `x86_64`, `Arm64` | +------------+--------------------+ ================================================================================ Documentation ================================================================================ Please refer to https://docs.nvidia.com/cuda/cusparselt/index.html for the cuSPARSELt documentation. ================================================================================ Installation ================================================================================ The cuSPARSELt wheel can be installed as follows: .. code-block:: bash pip install nvidia-cusparselt-cuXX where XX is the CUDA major version.