Lingo
Lingo copied to clipboard
Text encoding for modern C++
Lingo
Lingo is an encoding aware string library for C++11 and up. It aims to be a drop in replacement for the standard library strings by defining new string classes that mirror the standard library as much as possible, while also extending them with new functionality made possible by its encoding and code page aware design.
| Github Actions | Codecov | Coveralls | Releases | |
|---|---|---|---|---|
| Master | ||||
| Latest |
Features
- Encoding and code page aware
lingo::stringandlingo::string_viewclasses, almost fully compatible withstd::stringandstd::string_view. - Conversion constructors between
lingo::strings of different encodings and code pages. lingo::encoding::*for low level encoding and decoding of code points.lingo::page::*for additional code point information and conversion between different code pages.lingo::error::*for different error handling behaviours.lingo::encoding::point_iteratorandlingo::page::point_mapperhelpers to manually iterate or convert points individually.lingo::string_converterto manually convert entire strings.- Null terminator aware
lingo::string_view. lingo::make_null_terminatedhelper function for APIs that only support C strings, which ensures that a string is null terminated with minimal copying.
How it works
The string class in the C++ the standard library is defined like this:
namespace std
{
template <class CharT, class Traits, class Allocator>
class basic_string;
}
CharT is the code point type, and Traits contains all operations to work with the code units. This setup works fine for simple ASCII strings, but runs into problems when working with more complicated encodings.
- It assumes that every
CharTis a code point, while in reality most strings use some kind of multibyte encoding. Encodings such as UTF-8 and UTF-16 can be difficult to work with. - It has no information about the code page used.
charcould be ascii, utf8, iso 8859-1, or anything really. And while the standard is addingchar8_t,char16_tandchar32_tfor unicode, it really only knows that it is a form of Unicode, but has no idea how actually encode, decode or transform the data.
To solve this problem, Lingo defines a new string type:
namespace lingo
{
template <typename Encoding, typename Page, typename Allocator>
class basic_string;
}
Lingo splits the responsibility of managing the code points of a string between an Encoding type and a Page type.
The Encoding type defines how a code point can be encoded to and decoded from one or more code units. The Page type defines what every decoded code point actually means, and knows how to convert it to other Pages.
Here are some examples of what that actually looks like:
using ascii_string = lingo::basic_string<
lingo::encoding::none<char, char>,
lingo::page::ascii>;
using utf8_string = lingo::basic_string<
lingo::encoding::utf8<char8_t, char32_t>,
lingo::page::unicode>;
using utf16_string = lingo::basic_string<
lingo::encoding::utf16<char16_t, char32_t>,
lingo::page::unicode>;
using utf32_string = lingo::basic_string<
lingo::encoding::utf32<char32_t, char32_t>,
lingo::page::unicode>;
using iso_8895_1_string = lingo::basic_string<
lingo::encoding::none<unsigned char, unsigned char>,
lingo::page::iso_8895_1>;
You may wonder why there is a lingo::encoding::utf32 encoding, since there is no difference between UTF-32 and decoded Unicode. It is indeed possible to use lingo::encoding::none instead, and still have a fully functional UTF-32 string. However, lingo::encoding::utf32 does add some extra validation, such as detecting surrogate code units, making it better at dealing with invalid inputs.
Currently implemented
Encodings
lingo::encoding::nonelingo::encoding::utf8lingo::encoding::utf16lingo::encoding::utf32lingo::encoding::base64
Meta encodings
lingo::encoding::swap_endian: Swaps the endianness of the code units.lingo::encoding::join: Chains multiple encodings together (e.g.join<swap_edian, utf16>to createutf16_be).
Code pages
lingo::page::asciilingo::page::unicodelingo::page::iso_8859_nwith n = [1, 16] except 12.
Error handlers
lingo::error::strictThrows an exception on error.
Algorithms
Will be added in a future version.
How to build
Lingo is a header only library, but some of the header files do have to be generated first. You can check the latest releases for a package that has all headers generated for you.
If you want the library yourself, you will have to build the CMake project. All you need is CMake 3.12 or higher, Python 3 (for the code gen) and a C++11 compatible compiler. The tests are written using Catch and can be run with ctest.
How to include in your project
Since Lingo is a header only library, all you need to do is copy the header files and add it as an include directory.
There is one thing that you do need to look out for, which is the execution character set. This library assumes by default that char is UTF-8, and that wchar_t is UTF-16 or UTF-32, depending on the size of wchar_t.
This matches the default settings of GCC and Clang, but not of Visual Studio. If your compiler's execution set does not match the defaults, you have two options:
Configure your compiler
Configure the library
The following macros can be defined to overwrite the default encodings for char and wchar_t:
LINGO_CHAR_ENCODINGLINGO_WCHAR_ENCODINGLINGO_CHAR_PAGELINGO_WCHAR_PAGE
So for example, if you want to use ISO/IEC 8859-1 for chars, you will have to define the follow macros:
-DLINGO_CHAR_ENCODING=none-DLINGO_CHAR_PAGE=iso_8859_1
This method is not recommended. Compiler flags are a much more reliable way to set the correct execution encoding.
Other documentation
- Glossary
- Interfaces
- TODO (A very poorly written list of features to come)