nubasic icon indicating copy to clipboard operation
nubasic copied to clipboard

UTF-8 support

Open programandala-net opened this issue 6 years ago • 6 comments

Why only ASCII is supported? It's a suprising limitation. First I thougth it was a mistake of the manual: I thougth it meant only the identifiers. But effectively, UTF-8 or even Latin 1 strings are not accepted in BASIC sources (all non-ASCII characters are removed). And only ASCII characters are accepted by the command line interpreter.

Is nuBASIC ASCII-only by design? Or is UTF-8 going to be supported (at least just to print strings, not to manipulate them) in a future version?

programandala-net avatar Nov 10 '19 13:11 programandala-net

Hi Marcos. You are right. This is a limitation I can remove. It is in my todo-list. Originally, when I started with nuBASIC the purpose was to provide an example for my programming courses. I had in mind a kind of 80s interpreter, so supporting just ASCII was enough for that purpose. I will need to improve the tokenizer, which is responsible to read the input and transform in tokens. Such limitation was simplifying the implementation, but maybe now I can improve it. Thank you your suggestion. Kind regards, Antonino

On Sun, 10 Nov 2019 at 13:19, Marcos Cruz [email protected] wrote:

Why only ASCII is supported? It's a suprising limitation. First I thougth it was a mistake of the manual: I thougth it meant only the identifiers. But effectively, UTF-8 or even Latin 1 strings are not accepted in BASIC sources (all non-ASCII characters are removed). And only ASCII characters are accepted by the command line interpreter.

Is nuBASIC ASCII-only by design? Or is UTF-8 going to be supported (at least just to print strings, not to manipulate them) in a future version?

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/eantcal/nubasic/issues/3?email_source=notifications&email_token=ADDNYVRCPZ3CHR3VI2RJWBDQTAC6ZA5CNFSM4JLMKKN2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4HYHXDRA, or unsubscribe https://github.com/notifications/unsubscribe-auth/ADDNYVRR24DXTJE6HVHECOTQTAC6ZANCNFSM4JLMKKNQ .

eantcal avatar Nov 10 '19 15:11 eantcal

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8. The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

martindecker avatar Nov 14 '19 11:11 martindecker

I would prefer Latin-1 ( = ISO 8859-1 ), or a switch between ISO-8859-1 and UTF-8.

I'm not sure what you mean by "switch".

The reason is: My editors assume that .bas-Files are encoded in ISO 8859-1.

That is a problem of your editors ;)

Unicode is the way to go, and UTF-8 is its most practical encoding at the moment. Of course, it brings the issue about the BASIC string functions, but they could work with bytes as usual. The thing is to accept and print UTF-8 strings.

But anyway ISO 8859-1 is better than nothing: it would make nuBASIC useful to write programs in a few European languages other than English.

programandala-net avatar Nov 14 '19 19:11 programandala-net

Such limitation was simplifying the implementation, but maybe now I can improve it.

Thanks. I understand ASCII was enough for your initial scope, but it makes the language pretty useless for a more general usage.

programandala-net avatar Nov 14 '19 19:11 programandala-net

Hello, my answer regarding .bas Source-Files.

I'm not sure what you mean by "switch".

In Python the source code encoding is specified in Line 2 the following way: #!/usr/bin/python # -*- coding: iso-8859-1 -*- or #!/usr/bin/env python # -*- coding: utf-8 -*-

It could also be a new Basic command, why put important information into comments ?

An other possibility for having a switch is: The Byte order mark, present = UTF8, not-Present = Latin1 or see my following post. But does the BOM conflict with the shebang-Function in Linux ??? In a VB.Net- Source File, I found the Byte-Order-Mark ( EF BB BF ) at the beginning. But VB.Net has no shebang.

Strings

In bigger projects, the language-specific string constants are in "resource" or external files . Thanks Antonio for writing the Software - one thing at a time, perhaps first do sth regarding the Sourcecode-Question. There is a "Unicode for C++23" proposal: https://www.youtube.com/watch?v=3utLG0Qm1Ek Currently strings are encoding agnostic, "what comes in, goes out", is my experience with the nubasic command Input# . As in https://stackoverflow.com/questions/30277095/whats-the-definition-of-encoding-agnostic And this was useful for me.

Currently, nubasic strings are 8-bit-sequences! Only the Source Code is treated as 7-bit.

martindecker avatar Nov 14 '19 23:11 martindecker

I assume, around russia they have a huge amount of cp1251 -coded files, etc. On a more abstract level there are only two kinds of codings relevant today in our world:

  1. Some 256-Character encoding, determined by some Environment Variable(s) / locale settings for Editors and the Terminal. "Extended_ASCII" with 8 bits per character.
  2. UTF-8.

So a switch could also be between those two possibilities. 7-bit-Ascii is the common subset of both.

martindecker avatar Nov 15 '19 11:11 martindecker