Metadata-Version: 2.4 Name: chardet Version: 7.4.3 Summary: Universal character encoding detector Project-URL: Homepage, https://github.com/chardet/chardet Project-URL: Repository, https://github.com/chardet/chardet Project-URL: Bug Tracker, https://github.com/chardet/chardet/issues Project-URL: Documentation, https://chardet.readthedocs.io Project-URL: Changelog, https://chardet.readthedocs.io/en/latest/changelog.html Author-email: Dan Blanchard Maintainer: Ian Cordasco Maintainer-email: Dan Blanchard License-Expression: 0BSD License-File: LICENSE Keywords: chardet,charset,detection,encoding,unicode Classifier: Development Status :: 5 - Production/Stable Classifier: Intended Audience :: Developers Classifier: Operating System :: OS Independent Classifier: Programming Language :: Python Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.10 Classifier: Programming Language :: Python :: 3.11 Classifier: Programming Language :: Python :: 3.12 Classifier: Programming Language :: Python :: 3.13 Classifier: Programming Language :: Python :: 3.14 Classifier: Programming Language :: Python :: Implementation :: CPython Classifier: Programming Language :: Python :: Implementation :: PyPy Classifier: Topic :: Software Development :: Libraries :: Python Modules Classifier: Topic :: Text Processing :: Linguistic Requires-Python: >=3.10 Description-Content-Type: text/markdown # chardet Universal character encoding detector. [![License: 0BSD](https://img.shields.io/badge/License-0BSD-blue.svg)](LICENSE) [![Documentation](https://readthedocs.org/projects/chardet/badge/?version=latest)](https://chardet.readthedocs.io) [![codecov](https://codecov.io/github/chardet/chardet/branch/main/graph/badge.svg?token=m5ZQrMd3vk)](https://codecov.io/github/chardet/chardet) chardet 7 is a ground-up, 0BSD-licensed rewrite of [chardet](https://github.com/chardet/chardet). Same package name, same public API — drop-in replacement for chardet 5.x/6.x, just much faster and more accurate. Python 3.10+, zero runtime dependencies, works on PyPy. [Read more details about the rewrite process.](https://dan-blanchard.github.io/blog/chardet-rewrite-controversy/) ## Why chardet 7? **99.3% accuracy** on 2,517 test files. **47x faster** than chardet 6.0.0 and **1.5x faster** than charset-normalizer 3.4.6. **Language detection** for every result. **MIME type detection** for binary files. **0BSD licensed.** | | chardet 7.4.0 (mypyc) | chardet 6.0.0 | [charset-normalizer] 3.4.6 | | ---------------------- | :--------------------: | :-----------: | :-------------------------: | | Accuracy (2,517 files) | **99.3%** | 88.2% | 85.4% | | Speed | **551 files/s** | 12 files/s | 376 files/s | | Language detection | **95.7%** | 40.0% | 59.2% | | Peak memory | **52.9 MiB** | 29.5 MiB | 78.8 MiB | | Streaming detection | **yes** | yes | no | | Encoding era filtering | **yes** | no | no | | Encoding filters | **yes** | no | yes | | MIME type detection | **yes** | no | no | | Supported encodings | 99 | 84 | 99 | | License | 0BSD | LGPL | MIT | [charset-normalizer]: https://github.com/jawah/charset_normalizer ## Installation ```bash pip install chardet ``` ## Quick Start ```python import chardet chardet.detect(b"Python is a great programming language for beginners and experts alike.") # {'encoding': 'ascii', 'confidence': 1.0, 'language': 'en', 'mime_type': 'text/plain'} # UTF-8 English with accented characters chardet.detect("The naïve approach doesn't always work in complex systems.".encode("utf-8")) # {'encoding': 'utf-8', 'confidence': 0.84, 'language': 'en', 'mime_type': 'text/plain'} # Japanese EUC-JP chardet.detect("日本語の文字コード検出テストです。このテキストはEUC-JPでエンコードされています。正しく検出できるか確認します。".encode("euc-jp")) # {'encoding': 'EUC-JP', 'confidence': 1.0, 'language': 'ja', 'mime_type': 'text/plain'} # Get all candidate encodings ranked by confidence text = "Le café est une boisson très populaire en France et dans le monde entier." results = chardet.detect_all(text.encode("windows-1252")) for r in results[:4]: print(r["encoding"], round(r["confidence"], 2)) # Windows-1252 0.32 # iso8859-15 0.32 # ISO-8859-1 0.32 # MacRoman 0.31 ``` ### Streaming Detection For large files or network streams, use `UniversalDetector` to feed data incrementally: ```python from chardet import UniversalDetector detector = UniversalDetector() with open("unknown.txt", "rb") as f: for line in f: detector.feed(line) if detector.done: break result = detector.close() print(result) ``` ### Encoding Era Filtering Restrict detection to specific encoding eras to reduce false positives: ```python from chardet import detect_all from chardet.enums import EncodingEra data = "Москва является столицей Российской Федерации и крупнейшим городом страны.".encode("windows-1251") # All encoding eras are considered by default — 4 candidates across eras for r in detect_all(data): print(r["encoding"], round(r["confidence"], 2)) # Windows-1251 0.46 # MacCyrillic 0.42 # KZ1048 0.2 # ptcp154 0.2 # Restrict to modern web encodings — 1 confident result for r in detect_all(data, encoding_era=EncodingEra.MODERN_WEB): print(r["encoding"], round(r["confidence"], 2)) # Windows-1251 0.46 ``` ### Encoding Filters Restrict detection to specific encodings, or exclude encodings you don't want: ```python # Only consider UTF-8 and Windows-1252 chardet.detect(data, include_encodings=["utf-8", "windows-1252"]) # Consider everything except EBCDIC chardet.detect(data, exclude_encodings=["cp037", "cp500"]) ``` ## CLI ```bash chardetect somefile.txt # somefile.txt: utf-8 with confidence 0.99 chardetect --minimal somefile.txt # utf-8 # Include detected language chardetect -l somefile.txt # somefile.txt: utf-8 en (English) with confidence 0.99 # Only consider specific encodings chardetect -i utf-8,windows-1252 somefile.txt # somefile.txt: utf-8 with confidence 0.99 # Pipe from stdin cat somefile.txt | chardetect # stdin: utf-8 with confidence 0.99 ``` ## What's New in chardet 7? - **0BSD license** (previous versions were LGPL) - **Ground-up rewrite:** 13-stage detection pipeline using BOM detection, magic number identification, structural probing, byte validity filtering, and bigram statistical models - **47x faster** than chardet 6.0.0 with mypyc, **1.5x faster** than charset-normalizer 3.4.6 - **99.3% accuracy:** +11.1pp vs chardet 6.0.0, +13.9pp vs charset-normalizer 3.4.6 - **Language detection:** 95.7% accuracy across 49 languages, returned with every result - **MIME type detection:** identifies 40+ binary file formats (images, audio/video, archives, documents, executables, fonts) via magic number signatures, plus `text/html`, `text/xml`, and `text/x-python` for markup - **Encoding filters:** `include_encodings` and `exclude_encodings` parameters to restrict or exclude specific encodings from the candidate set - **99 encodings:** full coverage including EBCDIC, Mac, DOS, and Baltic/Central European families - **Optional mypyc compilation:** 1.67x additional speedup on CPython - **Thread-safe:** `detect()` and `detect_all()` are safe to call concurrently; scales on free-threaded Python - **Same API:** `detect()`, `detect_all()`, `UniversalDetector`, and the `chardetect` CLI all work as before ## Documentation Full documentation is available at [chardet.readthedocs.io](https://chardet.readthedocs.io). ## Project History chardet was originally created by [Mark Pilgrim](https://en.wikipedia.org/wiki/Mark_Pilgrim) in 2006 as a Python port of [Mozilla's universal charset detection library](https://www-archive.mozilla.org/projects/intl/chardet.html). He released versions 1.0 (2006) and 1.0.1 (2008) on PyPI, then developed an unreleased Python 3 port (2.0.1) on Google Code. After Mark [deleted his online accounts](https://en.wikipedia.org/wiki/Mark_Pilgrim#%22Infocide%22) in 2011, the project was continued by David Cramer, Erik Rose, Toshio Kuratomi, Ian Cordasco, and Dan Blanchard. In 2026, Dan Blanchard rewrote chardet using Claude, releasing chardet 7.0 under a new license. All releases after 7 are not derivative of the original chardet code, but are released under the same name to allow an easier transition for users who can immediately benefit from the speed and accuracy improvements. For historical preservation and to allow easier comparison with the other releases, Dan has restored Mark's lost commits to this repository in the `history/pilgrim` branch. To see the full history from 2006 to present in `git log`, fetch the graft refs: ``` git fetch origin 'refs/replace/*:refs/replace/*' ``` ## License [0BSD](LICENSE)