Metadata-Version: 2.2 Name: dfloat11 Version: 0.5.0 Summary: DFloat11: Fast and memory-efficient GPU inference for losslessly compressed LLMs and diffusion models Author: Tianyi Zhang Requires-Python: >=3.9 Description-Content-Type: text/markdown License-File: LICENSE Requires-Dist: accelerate Requires-Dist: dahuffman==0.4.2 Requires-Dist: huggingface-hub Requires-Dist: safetensors Requires-Dist: transformers Requires-Dist: tqdm Provides-Extra: cuda11 Requires-Dist: cupy-cuda11x; extra == "cuda11" Provides-Extra: cuda12 Requires-Dist: cupy-cuda12x; extra == "cuda12" Dynamic: author Dynamic: description Dynamic: description-content-type Dynamic: provides-extra Dynamic: requires-dist Dynamic: requires-python Dynamic: summary # DFloat11: Lossless LLM Compression for Efficient GPU Inference [![PyPI version](https://img.shields.io/pypi/v/dfloat11.svg?color=blue)](https://pypi.org/project/dfloat11/) [![arXiv](https://img.shields.io/badge/arXiv-2504.11651-b31b1b.svg)](https://arxiv.org/abs/2504.11651) [![Hugging Face](https://img.shields.io/badge/Model-%F0%9F%A4%97-yellow.svg)](https://huggingface.co/DFloat11) **DFloat11** is a lossless compression framework that reduces the size of Large Language Models (LLMs) by approximately **30%** while preserving **bit-for-bit identical outputs** to the original model. It enables efficient GPU inference on resource-constrained hardware without sacrificing accuracy. ## πŸ“° News - [08/24/2025] Compression code released! * Reduce the size of any model by 30% with DFloat11 compression. * Get started here: [examples/compress_flux1](https://github.com/LeanModels/DFloat11/tree/master/examples/compress_flux1). - [07/29/2025] Efficient CPU Offloading Now Supported! * Our latest update enables highly memory-efficient inference by keeping only one transformer block in GPU memory at a time. For example, CPU offloading reduces peak GPU memory for inference of **FLUX.1-Krea-dev from 17.5 to 9.8 GB, Qwen3-8B from 12.4 to 2.3 GB, and HiDream-I1-Full from 26.4 to 9.6 GB**. * An example of using CPU offloading with FLUX.1-Krea-dev-DF11 can be found [here](https://huggingface.co/DFloat11/FLUX.1-Krea-dev-DF11). * To enable CPU offloading, simply set `cpu_offload=True` when calling `DFloat11Model.from_pretrained(...)`. - [05/23/2025] **Wan2.1** support is now live! [`DFloat11/Wan2.1-T2V-14B-Diffusers-DF11`](https://huggingface.co/DFloat11/Wan2.1-T2V-14B-Diffusers-DF11) * Text-to-video generation with DFloat11 *Wan2.1 14B* using only 24GB VRAM! * Get started here: [examples/wan2.1](https://github.com/LeanModels/DFloat11/tree/master/examples/wan2.1). - [05/06/2025] **DFloat11 now supports [`FLUX.1-dev`](https://huggingface.co/black-forest-labs/FLUX.1-dev)** * πŸ–ΌοΈ Generate stunning text-to-image results on GPUs with **less than 24GB VRAM** --- no quality lost! * πŸ“‚ Get started here: [examples/flux.1](https://github.com/LeanModels/DFloat11/tree/master/examples/flux.1). - [05/05/2025] The `dfloat11` pip package has been upgraded to `v0.2.0`! Run `pip install -U dfloat11[cuda12]` to upgrade to the latest version. We have made the following important changes: * We added support for Qwen 3, Gemma 3, and Phi 4! * The GPU decompression kernel is now 20-40% faster! We achieved it by improving thread occupancy and implementing tons of optimizations. * The DFloat11 models are now stored in safetensors format for better safety and loading performance. * When using a DFloat11 model, only the compressed model is downloaded, not the original model. ## πŸ“¦ Installation Requires a CUDA-compatible GPU and [PyTorch](https://pytorch.org/get-started/locally/) installed. ```bash pip install -U dfloat11[cuda12] # or if you have CUDA version 11: # pip install -U dfloat11[cuda11] ``` ## πŸ” How It Works DFloat11 compresses model weights using **Huffman coding** of BFloat16 exponent bits, combined with **hardware-aware algorithmic designs** that enable efficient on-the-fly decompression directly on the GPU. During inference, the weights remain compressed in GPU memory and are **decompressed just before matrix multiplications**, then **immediately discarded after use** to minimize memory footprint. Key benefits: * **No CPU decompression or host-device data transfer**: all operations are handled entirely on the GPU. * **Decompression overhead is constant** per forward pass and **independent of batch size**, making DFloat11 increasingly efficient at larger batch sizes. * DFloat11 is **much faster than CPU-offloading approaches**, enabling practical deployment in memory-constrained environments. * At batch size = 1, inference is approximately 2Γ— slower than the original BF16 model, but the performance gap narrows significantly with larger batches. * The compression is **fully lossless**, guaranteeing that the model’s outputs are **bit-for-bit identical** to those of the original model. ## πŸš€ Quick Start 1. Install the `dfloat11` pip package. See [Installation](#-installation). 2. Run the following code in Python, which automatically downloads the DFloat11 `Qwen3-8B` model and generates a response. ```python import torch from dfloat11 import DFloat11Model from transformers import AutoTokenizer model_id = "DFloat11/Qwen3-8B-DF11" model = DFloat11Model.from_pretrained(model_id, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_id) tokenizer.pad_token = tokenizer.eos_token prompt = "Question: What is a binary tree and its applications? Answer:" inputs = tokenizer(prompt, return_tensors="pt", padding=True).to(model.device) with torch.no_grad(): output = model.generate( **inputs, max_new_tokens=256, do_sample=True, ) print(tokenizer.batch_decode(output, skip_special_tokens=True)) ``` 3. Replace the `model_id` in the script above with any pre-compressed model in the [Model Hub](#-model-hub). ## 🏎️ Benchmarking Performance To test the speed and memory consumption a DFloat11 LLM during inference: ```bash CUDA_VISIBLE_DEVICES=0 python inference.py \ --model_name_or_path DFloat11/Qwen3-8B-DF11 \ --prompt "Question: What is a binary tree and its applications? Answer:" \ --num_tokens 512 \ --batch_size 1 ``` > πŸ’‘ **Tip**: If you specify multiple CUDA devices (e.g., `CUDA_VISIBLE_DEVICES=0,1`), the model will be automatically distributed across them using πŸ€— Accelerate's `device_map="auto"`. ### Arguments - `--model_name_or_path`: HuggingFace name or local path of the DFloat11 model (e.g., `DFloat11/Qwen3-8B-DF11`). See the [Model Hub](#-model-hub) section for a list of available DFloat11 models. - `--bf16`: *(Optional)* Turn on this flag when passing a BFloat16 model to `--model_name_or_path` - `--prompt`: Input prompt string for text generation - `--num_tokens`: Number of new tokens to generate per sample - `--batch_size`: Number of prompts to process in parallel - `--seed`: *(Optional)* Random seed for reproducible results ### Output The script prints: - Generated responses - Total decoding latency - Tokens per second (throughput) - GPU memory usage (allocated and peak) ## πŸ“š Model Hub | Model | DFloat11 Link | |-------|---------------| | Wan2.1 T2V 14B (see [examples/wan2.1](https://github.com/LeanModels/DFloat11/tree/master/examples/wan2.1)) | [DFloat11/Wan2.1-T2V-14B-Diffusers-DF11](https://huggingface.co/DFloat11/Wan2.1-T2V-14B-Diffusers-DF11) | | FLUX.1 dev (see [examples/flux.1](https://github.com/LeanModels/DFloat11/tree/master/examples/flux.1)) | [DFloat11/FLUX.1-dev-DF11](https://huggingface.co/DFloat11/FLUX.1-dev-DF11) | | Qwen 3 32B | [DFloat11/Qwen3-32B-DF11](https://huggingface.co/DFloat11/Qwen3-32B-DF11) | | Qwen 3 14B | [DFloat11/Qwen3-14B-DF11](https://huggingface.co/DFloat11/Qwen3-14B-DF11) | | Qwen 3 8B | [DFloat11/Qwen3-8B-DF11](https://huggingface.co/DFloat11/Qwen3-8B-DF11) | | Qwen 3 4B | [DFloat11/Qwen3-4B-DF11](https://huggingface.co/DFloat11/Qwen3-4B-DF11) | | Phi 4 Reasoning Plus | [DFloat11/Phi-4-reasoning-plus-DF11](https://huggingface.co/DFloat11/Phi-4-reasoning-plus-DF11) | | Gemma 3 27B Instruct | [DFloat11/gemma-3-27b-it-DF11](https://huggingface.co/DFloat11/gemma-3-27b-it-DF11) | | Gemma 3 12B Instruct | [DFloat11/gemma-3-12b-it-DF11](https://huggingface.co/DFloat11/gemma-3-12b-it-DF11) | | Gemma 3 4B Instruct | [DFloat11/gemma-3-4b-it-DF11](https://huggingface.co/DFloat11/gemma-3-4b-it-DF11) | | Llama 3.1 8B Instruct | [DFloat11/Llama-3.1-8B-Instruct-DF11](https://huggingface.co/DFloat11/Llama-3.1-8B-Instruct-DF11) | | DeepSeek R1 Distill Qwen 32B | [DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-32B-DF11) | | DeepSeek R1 Distill Qwen 14B | [DFloat11/DeepSeek-R1-Distill-Qwen-14B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-14B-DF11) | | DeepSeek R1 Distill Qwen 7B | [DFloat11/DeepSeek-R1-Distill-Qwen-7B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Qwen-7B-DF11) | | DeepSeek R1 Distill Llama 8B | [DFloat11/DeepSeek-R1-Distill-Llama-8B-DF11](https://huggingface.co/DFloat11/DeepSeek-R1-Distill-Llama-8B-DF11) | | ... | [Discover more models on our HF page!](https://huggingface.co/DFloat11) | ### How to Use a DFloat11 Model 1. Download a model using the HuggingFace command line tool: ```bash huggingface-cli download \ DFloat11/Llama-3.1-8B-Instruct-DF11 \ # DFloat11 model name --local-dir ./Llama-3.1-8B-Instruct-DF11 # local path to download the DFloat11 model ``` 2. Run the following in Python to load the model and tokenizer: ```python from dfloat11 import DFloat11Model from transformers import AutoTokenizer model_path = "./Llama-3.1-8B-Instruct-DF11" model = DFloat11Model.from_pretrained(model_path, device_map="auto") tokenizer = AutoTokenizer.from_pretrained(model_path) ``` ## πŸ—œοΈ Compressing Models (BFloat16 β†’ DFloat11) The DFloat11 compression utility is exposed via the `compress_model` function. Check [examples/compress_flux1](https://github.com/LeanModels/DFloat11/tree/master/examples/compress_flux1) for a detailed example on compressing the FLUX.1 model. ## πŸ”— Links πŸ‘‰ Explore pre-compressed DFloat11 models ready to use on HuggingFace: **[https://huggingface.co/DFloat11](https://huggingface.co/DFloat11)** πŸ“‚ Official Code Repository: [https://github.com/LeanModels/DFloat11](https://github.com/LeanModels/DFloat11) ## 🧠 Contributions This work is brought to you by the team at Rice University and [xMAD.ai](https://xmad.ai/). The GPU kernel was designed and implemented by [Tianyi Zhang](https://github.com/tonyzhang617). ## πŸ“š Citation If you found our work useful or interesting, please consider citing our paper: ```bibtex @article{zhang2025dfloat11, title={70\% Size, 100\% Accuracy: Lossless LLM Compression for Efficient GPU Inference via Dynamic-Length Float}, author={Zhang, Tianyi and Sui, Yang and Zhong, Shaochen and Chaudhary, Vipin and Hu, Xia and Shrivastava, Anshumali}, journal={arXiv preprint arXiv:2504.11651}, year={2025} } ```