chardet4j icon indicating copy to clipboard operation
chardet4j copied to clipboard

Simple, compact charset detection for Java 8+

CHARDET4J tests Maven Central

Introduction

The state-of-the-art character set detection library for Java is icu4j. However, the icu4j JAR file is about 13MB. This is a hefty price to pay for programs that only require charset detection! There should be a smaller option of the same quality.

The chardet4j library pulls the CharsetDetector feature from icu4j and repackages it as this standalone library. This allows programs to make good use of this important feature without bloating their JARs. At the time of this writing, the chardet4j JAR comes in around 75KB.

This library also implements some other important components of character set detection and decoding, namely byte order mark handling.

Getting Started

To decode an InputStream to a Reader by detecting its character set:

try (Reader chars=Chardet.decode(bytes, StandardCharsets.UTF_8)) {
    // Process chars here
}

Note that the UTF-8 encoding is used by default if the character set cannot be detected automatically.

Charset detection is important when dealing with content of unknown provenance, like content downloaded from the internet. In such cases, users often have a declared encoding, typically from a content type. The name of the declared encoding can be provided as a hint to charset detection:

try (Reader chars=Chardet.decode(bytes, declaredEncoding, StandardCharsets.UTF_8)) {
    // Process chars here
}

Byte arrays can be converted directly to Strings as well:

String chars=Chardet.toString(bytes, declaredEncoding, StandardCharsets.UTF_8);

Users can simply detect a character set, too:

Optional<Charset> maybeCharset=Chardet.detectCharset(bytes, declaredEncoding);

Licensing

The icu library is released under the ICU license. The chardet4j library is released under the Apache license.