decancer
decancer copied to clipboard
A library that removes common unicode confusables/homoglyphs from strings.
decancer

A tiny package that removes common unicode confusables/homoglyphs from strings.
- Its core is written in Rust and utilizes a form of Binary Search to ensure speed!
- By default, it's capable of filtering 216,170 (19.40%) different unicode codepoints like:
- All whitespace characters
- All diacritics, this also eliminates all forms of Zalgo text
- Most leetspeak characters
- Most homoglyphs
- Several emojis
- Unlike other packages, this package is unicode bidi-aware in a way that it also interprets right-to-left characters in the same way as it were to be rendered by an application!
- Its behavior is also highly customizable to your liking!
- And it's available in the following languages:
Installation
Rust (v1.64 or later)
In your Cargo.toml
:
decancer = "3.1.2"
JavaScript (Node.js)
In your shell:
$ npm install decancer
In your code (CommonJS):
const decancer = require('decancer')
In your code (ESM):
import decancer from 'decancer'
JavaScript (Browser)
In your code:
<script type="module">
import init from 'https://cdn.jsdelivr.net/gh/null8626/[email protected]/bindings/wasm/bin/decancer.min.js'
const decancer = await init()
</script>
Java
As a dependency
In your build.gradle
:
repositories {
mavenCentral()
maven { url 'https://jitpack.io' }
}
dependencies {
implementation 'com.github.null8626:decancer:v3.1.2'
}
In your pom.xml
:
<repositories>
<repository>
<id>central</id>
<url>https://repo.maven.apache.org/maven2</url>
</repository>
<repository>
<id>jitpack.io</id>
<url>https://jitpack.io</url>
</repository>
</repositories>
<dependencies>
<dependency>
<groupId>com.github.null8626</groupId>
<artifactId>decancer</artifactId>
<version>v3.1.2</version>
</dependency>
</dependencies>
Building from source
$ git clone https://github.com/null8626/decancer.git --depth 1
$ cd ./decancer/bindings/java
$ unzip ./bin/bindings.zip -d ./bin
$ chmod +x ./gradlew
$ ./gradlew build --warning-mode all
C/C++
Download
- Download for ARM64 macOS (11.0+, Big Sur+)
- Download for ARM64 iOS
- Download for Apple iOS Simulator on ARM6
- Download for ARM64 Android
- Download for ARM64 Windows MSVC
- Download for ARM64 Linux (kernel 4.1, glibc 2.17+)
- Download for ARM64 Linux with MUSL
- Download for ARMv6 Linux (kernel 3.2, glibc 2.17)
- Download for ARMv5TE Linux (kernel 4.4, glibc 2.23)
- Download for ARMv7-A Android
- Download for ARMv7-A Linux (kernel 4.15, glibc 2.27)
- Download for ARMv7-A Linux, hardfloat (kernel 3.2, glibc 2.17)
- Download for 32-bit Linux w/o SSE (kernel 3.2, glibc 2.17)
- Download for 32-bit MSVC (Windows 7+)
- Download for 32-bit FreeBSD
- Download for 32-bit Linux (kernel 3.2+, glibc 2.17+)
- Download for PPC64LE Linux (kernel 3.10, glibc 2.17)
- Download for RISC-V Linux (kernel 4.20, glibc 2.29)
- Download for S390x Linux (kernel 3.2, glibc 2.17)
- Download for SPARC Solaris 11, illumos
- Download for Thumb2-mode ARMv7-A Linux with NEON (kernel 4.4, glibc 2.23)
- Download for 64-bit macOS (10.12+, Sierra+)
- Download for 64-bit iOS
- Download for 64-bit MSVC (Windows 7+)
- Download for 64-bit FreeBSD
- Download for 64-bit illumos
- Download for 64-bit Linux (kernel 3.2+, glibc 2.17+)
- Download for 64-bit Linux with MUSL
Building from source
Building from source requires Rust v1.64 or later.
$ git clone https://github.com/null8626/decancer.git --depth 1
$ cd decancer/bindings/native
$ cargo build --release
And the binary files should be generated in the target/release
directory.
Examples
Rust
For more information, please read the documentation.
let mut cured = decancer::cure!("vοΌ₯β‘π π½πΕβο½ Ε£δΉππ£").unwrap();
assert_eq!(cured, "very funny text");
assert!(cured.contains("funny"));
cured.censor("funny", '*');
assert_eq!(cured, "very ***** text");
cured.censor_multiple(["very", "text"], '-');
assert_eq!(cured, "---- ***** ----");
JavaScript (Node.js)
const assert = require('assert')
const cured = decancer('vοΌ₯β‘π π½πΕβο½ Ε£δΉππ£')
assert(cured.equals('very funny text'))
assert(cured.contains('funny'))
console.log(cured.toString()) // very funny text
cured.censor('funny', '*')
console.log(cured.toString()) // very ***** text
cured.censorMultiple(['very', 'text'], '-')
console.log(cured.toString()) // ---- ***** ----
JavaScript (Browser)
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="utf-8" />
<title>Decancerer!!! (tm)</title>
<style>
textarea {
font-size: 30px;
}
#cure {
font-size: 20px;
padding: 5px 30px;
}
</style>
</head>
<body>
<h3>Input cancerous text here:</h3>
<textarea rows="10" cols="30"></textarea>
<br />
<button id="cure" onclick="cure()">cure!</button>
<script type="module">
import init from 'https://cdn.jsdelivr.net/gh/null8626/[email protected]/bindings/wasm/bin/decancer.min.js'
const decancer = await init()
window.cure = function () {
const textarea = document.querySelector('textarea')
if (!textarea.value.length) {
return alert("There's no text!!!")
}
textarea.value = decancer(textarea.value).toString()
}
</script>
</body>
</html>
Java
import com.github.null8626.decancer.CuredString;
public class Program {
public static void main(String[] args) {
CuredString cured = new CuredString("vοΌ₯β‘π π½πΕβο½ Ε£δΉππ£");
assert cured.equals("very funny text");
assert cured.contains("funny");
System.out.println(cured.toString()); // very funny text
cured.censor("funny", '*');
System.out.println(cured.toString()); // very ***** text
String[] keywords = { "very", "text" };
cured.censorMultiple(keywords, '-');
System.out.println(cured.toString()); // ---- ***** ----
cured.destroy();
}
}
C/C++
UTF-8 example:
#include <decancer.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
// global variable for assertion purposes only
decancer_cured_t cured;
static void assert(const bool expr, const char *message)
{
if (!expr)
{
fprintf(stderr, "assertion failed (%s)\n", message);
decancer_cured_free(cured);
exit(1);
}
}
static void print_error(decancer_error_t *error)
{
char message[90];
uint8_t message_size;
memcpy(message, error->message, error->message_size);
// rust strings are NOT null-terminated
message[error->message_size] = '\0';
fprintf(stderr, "error: %s", message);
}
int main(void) {
decancer_error_t error;
// UTF-8 bytes for "vοΌ₯β‘π π½πΕβο½ Ε£δΉππ£"
uint8_t string[] = {0x76, 0xef, 0xbc, 0xa5, 0xe2, 0x93, 0xa1, 0xf0, 0x9d, 0x94, 0x82, 0x20, 0xf0, 0x9d,
0x94, 0xbd, 0xf0, 0x9d, 0x95, 0x8c, 0xc5, 0x87, 0xe2, 0x84, 0x95, 0xef, 0xbd, 0x99,
0x20, 0xc5, 0xa3, 0xe4, 0xb9, 0x87, 0xf0, 0x9d, 0x95, 0x8f, 0xf0, 0x9d, 0x93, 0xa3};
cured = decancer_cure(string, sizeof(string), DECANCER_OPTION_DEFAULT, &error);
if (cured == NULL)
{
print_error(&error);
return 1;
}
assert(decancer_equals(cured, (uint8_t *)("very funny text"), 15), "equals");
assert(decancer_contains(cured, (uint8_t *)("funny"), 5), "contains");
// coerce output as a raw UTF-8 pointer and retrieve its size (in bytes)
size_t output_size;
const uint8_t *output_raw = decancer_cured_raw(cured, &output_size);
assert(output_size == 15, "raw output size");
// UTF-8 bytes for "very funny text"
const uint8_t expected_raw[] = {0x76, 0x65, 0x72, 0x79, 0x20, 0x66, 0x75, 0x6e,
0x6e, 0x79, 0x20, 0x74, 0x65, 0x78, 0x74};
char assert_message[38];
for (uint32_t i = 0; i < sizeof(expected_raw); i++)
{
sprintf(assert_message, "mismatched utf-8 contents at index %u", i);
assert(output_raw[i] == expected_raw[i], assert_message);
}
decancer_cured_free(cured);
return 0;
}
UTF-16 example:
#include <decancer.h>
#include <string.h>
#include <stdlib.h>
#include <stdio.h>
// global variable for assertion purposes only
decancer_cured_t cured;
decancer_cured_raw_wide_t wide = NULL;
static void assert(const bool expr, const char *message)
{
if (!expr)
{
fprintf(stderr, "assertion failed (%s)\n", message);
if (wide != NULL)
{
decancer_cured_raw_wide_free(wide);
}
decancer_cured_free(cured);
exit(1);
}
}
static void print_error(decancer_error_t *error)
{
char message[90];
uint8_t message_size;
memcpy(message, error->message, error->message_size);
// rust strings are NOT null-terminated
message[error->message_size] = '\0';
fprintf(stderr, "error: %s", message);
}
int main(void) {
decancer_error_t error;
// UTF-16 bytes for "vοΌ₯β‘π π½πΕβο½ Ε£δΉππ£"
uint16_t string[] = {
0x0076, 0xff25, 0x24e1,
0xd835, 0xdd02, 0x0020,
0xd835, 0xdd3d, 0xd835,
0xdd4c, 0x0147, 0x2115,
0xff59, 0x0020, 0x0163,
0x4e47, 0xd835, 0xdd4f,
0xd835, 0xdce3
};
cured = decancer_cure_wide(string, sizeof(string), DECANCER_OPTION_DEFAULT, &error);
if (cured == NULL)
{
print_error(&error);
return 1;
}
assert(decancer_equals(cured, (uint8_t *)("very funny text"), 15), "equals");
assert(decancer_contains(cured, (uint8_t *)("funny"), 5), "contains");
// coerce output as a raw UTF-16 pointer and retrieve its size (in bytes)
uint16_t *output_ptr;
size_t utf16_output_size;
wide = decancer_cured_raw_wide(cured, &output_ptr, &utf16_output_size);
assert(utf16_output_size == (15 * sizeof(uint16_t)), "raw output size");
// UTF-16 bytes for "very funny text"
const uint16_t expected_raw[] = {0x76, 0x65, 0x72, 0x79, 0x20, 0x66, 0x75, 0x6e,
0x6e, 0x79, 0x20, 0x74, 0x65, 0x78, 0x74};
char assert_message[39];
for (uint32_t i = 0; i < sizeof(expected_raw) / sizeof(uint16_t); i++)
{
sprintf(assert_message, "mismatched utf-16 contents at index %u", i);
assert(output_raw[i] == expected_raw[i], assert_message);
}
decancer_cured_raw_wide_free(wide);
decancer_cured_free(cured);
return 0;
}
Compatibility
Decancer is supported in the following platforms:
Platform name | C/C++/Rust | Java | JavaScript |
---|---|---|---|
ARM64 macOS (11.0+, Big Sur+) | β | β | β |
ARM64 iOS | β | ||
Apple iOS Simulator on ARM6 | β | ||
ARM64 Android | β | β | |
ARM64 Windows MSVC | β | β | β |
ARM64 Linux (kernel 4.1, glibc 2.17+) | β | β | β |
ARM64 Linux with MUSL | β | β | β |
ARMv6 Linux (kernel 3.2, glibc 2.17) | β | β | |
ARMv5TE Linux (kernel 4.4, glibc 2.23) | β | β | |
ARMv7-A Android | β | β | |
ARMv7-A Linux (kernel 4.15, glibc 2.27) | β | β | |
ARMv7-A Linux, hardfloat (kernel 3.2, glibc 2.17) | β | β | β |
32-bit Linux w/o SSE (kernel 3.2, glibc 2.17) | β | ||
32-bit MSVC (Windows 7+) | β | β | β |
32-bit FreeBSD | β | β | |
32-bit Linux (kernel 3.2+, glibc 2.17+) | β | β | |
PPC64LE Linux (kernel 3.10, glibc 2.17) | β | ||
RISC-V Linux (kernel 4.20, glibc 2.29) | β | β | |
S390x Linux (kernel 3.2, glibc 2.17) | β | ||
SPARC Solaris 11, illumos | β | ||
Thumb2-mode ARMv7-A Linux with NEON (kernel 4.4, glibc 2.23) | β | ||
64-bit macOS (10.12+, Sierra+) | β | β | β |
64-bit iOS | β | ||
64-bit MSVC (Windows 7+) | β | β | β |
64-bit FreeBSD | β | β | |
64-bit illumos | β | ||
64-bit Linux (kernel 3.2+, glibc 2.17+) | β | β | β |
64-bit Linux with MUSL | β | β | β |
Contributing
Please read CONTRIBUTING.md
for newbie contributors who want to contribute!