Unishox: A hybrid encoder for Short Unicode Strings

In general compression utilities such as zip, gzip do not compress short strings well and often expand them. They also use lots of memory which makes them unusable in constrained environments like Arduino. So Unishox algorithm was developed for individually compressing (and decompressing) short strings.

Note: The present byte-code version is 2 and it replaces Unishox 1. Unishox 1 is still available as unishox1.c, but it will have to be compiled manually if it is needed.

This is a C/C++ library. See here for CPython version and here for Javascript version which is interoperable with this library.

Applications

Compression for low memory devices such as Arduino and ESP8266
Compression of Chat application text exchange include Emojis
Storing compressed text in database
Faster retrieval speed when used as join keys
Bandwidth and storage cost reduction for Cloud

Promo picture

How it works

Unishox is an hybrid encoder (entropy, dictionary and delta coding). It works by assigning fixed prefix-free codes for each letter in the above Character Set (entropy coding). It also encodes repeating letter sets separately (dictionary coding). For Unicode characters, delta coding is used.

The model used for arriving at the prefix-free code is shown below:

Promo picture

The complete specification can be found in this article: A hybrid encoder for compressing Short Unicode Strings. This can also be found at figshare here with DOI 10.6084/m9.figshare.17056334.v2.

Compiling

To compile, just use make or use gcc as follows:

gcc -std=c99 -o unishox2 test_unishox2.c unishox2.c

Unit tests (automated)

For testing the compiled program, use:

./test_unishox2 -t

This invokes run_unit_tests() function of test_unishox2.c, which tests all the features of Unishox2, including edge cases, using 159 strings covering several languages, emojis and binary data.

Further, the CI pipeline at .github/workflows/c-cpp.yml runs these tests for all presets and also tests file compression for the different types of files in sample_texts folder. This happens whenever a commit is made to the repository.

API

int unishox2_compress_simple(const char *in, int len, char *out);
int unishox2_decompress_simple(const char *in, int len, char *out);

Usage

To see Unishox in action, simply try to compress a string:

./test_unishox2 "Hello World"

To compress and decompress a file, use:

./test_unishox2 -c <input_file> <compressed_file>
./test_unishox2 -d <compressed_file> <decompressed_file>

Note: Unishox is good for text content upto few kilobytes. Unishox does not give good ratios compressing large files or compressing binary files.

Character Set

Unishox supports the entire Unicode character set. As of now it supports UTF-8 as input and output encoding.

Interoperability with the JS Library

Strings that were compressed with this library can be decompressed with the JS Library and vice-versa. However please see this section in the documentation for usage.

Projects that use Unishox

Credits

Thanks to Jonathan Greenblatt for his port of Unishox2 that works on Particle Photon
Thanks to Chris Partridge for his port of Unishox2 to CPython and his comprehensive tests using Hypothesis and extensive performance tests
Thanks to Stephan Hadinger for his port of Unishox1 to Python for Tasmota
Thanks to Luis Díaz Más for his PRs to support MSVC and CMake setup
Thanks to James Z.M. Gao for his PRs on improving presets, unit tests, bug fixes and more
Thanks to Jm Casler and Shiv Kokroo for choosing and integrating Unishox into Meshtastic project

Sponsor

If you like this work, you could buy me coffee. However don't get pressured by this. Feel free to use this work as you like.

Issues

In case of any issues, please email the Author (Arundale Ramanathan) at [email protected] or create GitHub issue.

Unishox2
Unishox2 copied to clipboard

Metadata

Unishox: A hybrid encoder for Short Unicode Strings

Applications

How it works

Compiling

Unit tests (automated)

API

Usage

Character Set

Interoperability with the JS Library

Projects that use Unishox

Credits

Sponsor

Issues

← Metadata

Owner

Metadata

Unishox2 Unishox2 copied to clipboard

Metadata

Unishox: A hybrid encoder for Short Unicode Strings

Applications

How it works

Compiling

Unit tests (automated)

API

Usage

Character Set

Interoperability with the JS Library

Projects that use Unishox

Credits

Sponsor

Issues

← Metadata

Owner

Metadata

Unishox2
Unishox2 copied to clipboard