customasm Feature request: Ability to have strings encoded one byte per-address

I'm implementing a CPU architecture which has 16 bit addresses, so ive been using #bits 16 which works fine for instructions, but strings are encoded two bytes per address (so, "text" becomes 0x7465 0x7874) which is very inconvenient to work with, so it would be nice to have a way to make strings encode such that only one byte is encoded per address (so "text" would be 0x0074 0x0065 0x0078 0x0074), without needing to manually insert a space between each character, or write a program that can manipulate the data to single bytes.

Jun 27 '21 05:06 soweli-Luna

Hmm, would that be equivalent to the big-endian UTF-16 encoding? Would you be working with characters outside of ASCII?

Jun 28 '21 15:06 hlorenzi

I mean, I guess that would work for what I'm doing, but I'd imagine it should still work with non-ASCII characters, still just encoding one byte per address

I suggested in a comment of another feature request that allowing users to create their own data formats would solve that issue, and it would solve this one too

Regardless, to be clear, this should only be optional, it should still be possible to encode strings normally.. But I think a way to encode them this way is vital, especially considering the possibility of a system with weird-numbered bit length, like say 10-bit.. My case is rather mild since exactly 2 bytes are encoded per address, which makes it little more than fairly inconvenient to work with, but just imagine if bytes were being cut in half and stuff..

Jun 28 '21 17:06 soweli-Luna

What's with the sudden silence?

..forgive me for asking

Jul 15 '21 23:07 soweli-Luna

I agree this would be a useful feature, with or without UTF-16.

Jul 17 '21 07:07 pol-rivero

Sorry for the lack of activity! I've been thinking about the best syntax to express string conversion like this. I was thinking something along the lines of:

#d pad16("Hey") ; UTF-8 padded with zeroes into 16-bit units (00 48 00 65 00 79)

#d "á" ; standard UTF-8 encoding (c3 a1)
#d pad16("á") ; breaks up UTF-8 bytes in a non-standard way (00 c3 00 a1)
#d utf16be("á") ; standard UTF-16 big-endian encoding (00 e1)

And then, there could be a global directive so you could set the default encoding for all strings in your source file, without resorting to the conversion function every time:

#strformat pad16
#d "Hey" ; naturally gives pad16 encoding (00 48 00 65 00 79)

#d pad16("Hey") ; but watch out for double-encoding mistakes like this (00 00 00 48 00 00 00 65 00 00 00 79)

What do y'all think?

Jul 22 '21 13:07 hlorenzi

I think that this format looks great!

Jul 22 '21 19:07 pol-rivero

Yeah, that looks good.

You should be able to use other word lengths besides 16 though, as the problem with padding is especially important given scenarios with weird-numbered word lengths, like say 10 bit..

I want to again bring attention to my idea of having user-defined data encoding methods, so users could define their own data format, which would fix this issue, as well as many other conceivable issues to do with unsupported or non-standard data formats

Jul 23 '21 04:07 soweli-Luna

What would these user-defined data encoding methods look like? Can you come up with an example for the syntax?

Jul 25 '21 14:07 hlorenzi

I really have no idea, lol

The most powerful version of it I can think of would be almost as powerful as a complete programming language, where users would manipulate data in much the same way they might in a full-fledged programming language, doing arithmetic and bitwise operations, using conditions, and even stuff like associating values to an index in a table or something

It could probably be made simpler, though I don't know what the syntax for something like that might look like

Jul 26 '21 01:07 soweli-Luna

Depending on how its done, it might even make adding natively supported formats easier too, where they could be written in this more abstract form and included almost like libraries

Jul 26 '21 01:07 soweli-Luna

perhaps it would look like a function, something like:

#formatdef customformat(string, int)     ;random example format
{
     assert(isString(string))     ;example syntax, idk exactly how you would do this, but I think its
     assert(isInt(int))           ;(cont.) important to be able to make sure the inputs are of the correct format
     
     a = (int & 0xff) + 4   ;random example arithmetic
     
     for [every character in string] 
     {
          ;maybe you could assemble data by pushing it onto a stack or something like:
          
          push(stringChar(string, i)`8)     ;again, stringChar() is just for the sake of example, I don't know how 
          for int { push(0`0) }             ;(cont.) you should actually do string manipulation like this
          
          ;or have an indexable table sort of like:

          stack[i] = stringChar(string, i)`8
          index = i
          for int { stack[index + i] = 0`0 }
     }
     
     return(stack)
}

Jul 26 '21 03:07 soweli-Luna

Is this issue still being worked on?

Aug 22 '21 19:08 soweli-Luna

Still being worked on, but I have not had much time to work on it recently, sorry! I can say we're probably only going for a simple solution like the one I've formulated here.

Aug 22 '21 22:08 hlorenzi

okay, cool

Aug 23 '21 10:08 soweli-Luna