nitrogen icon indicating copy to clipboard operation
nitrogen copied to clipboard

Byte Strings

Open lfkeitel opened this issue 6 years ago • 0 comments

Strings are implemented a slice of Unicode runes. Meaning arbitrary byte sequences are not allowed, or the very least not guaranteed. There needs to be a way to manipulate arbitrary byte data.

Other languages handle this a little differently. Some like Python 3 and Rust have UTF-8 strings and byte strings. Other languages like PHP and JavaScript have a single string type of just bytes.

The runtime could be modified to store a String as a byte slice instead of rune slice. This would require conversions for indexing and string manipulation functions. However this would allow a single type to serve both purposes.

However there's value in distinguishing between the two string types as they serve different purposes. A normal string is guaranteed to be a valid UTF-8 string. While a byte string would be nothing more than bytes that may or may not mean anything. Having them separate would also ensure there's no accidental usage of a byte string in the place of a normal string. There would be conversion functions between the two if needed.

Syntax Notes

As for syntax, maybe borrow Python's way of using the prefix b to denote the following is a byte string. That should be easy to parse. Bytes strings would not be allowed where valid strings are needed in the existing syntax. Examples being import statements, isDefined function, etc.

Will also need to add support for hex literals inside quotes.

b"\xDE\xAD\xBE\xEF"

Implementation Notes

New token to denote a byte string from a regular string. New AST node using a byte slice instead of rune slice. New Object type with the same change. toBytes function to convert a UTF-8 string to a byte string. toString would be modified to allow the reverse. Byte slices can be concated together as well as indexed.

I'm not sure about nay utility functions like the string ones. Byte strings have a particular usage where replace, find, etc would be all that useful. Perhaps start without them and add them later if needed. Maybe allow toBytes to take an array of numbers and convert them to a byte array. That can make generation a but easier for the programmer.

lfkeitel avatar Feb 16 '19 03:02 lfkeitel