goawk Reconsider handling of bytes vs characters in length() and similar functions

Per @arnoldrobbins comment https://github.com/rmyorston/busybox-w32/issues/174#issuecomment-599194834, the POSIX spec says that length and similar functions should work with characters, not bytes. The POSIX spec says:

The index, length, match, and substr functions should not be confused with similar functions in the ISO C standard; the awk versions deal with characters, while the ISO C standard deals with bytes.

Given that, I wonder why (onetrueawk) awk's length() function counts bytes instead of characters, at least by default. Here's output on my macOS Catalina machine:

$ echo 絵 | nawk '{print length}'    # onetrueawk/awk 20200228
3
$ echo 絵 | gawk '{print length}'    # gawk 5.0.1
1
$ echo 絵 | mawk '{print length}'    # mawk 1.3.4 2020012
3
$ echo 絵 | goawk '{print length}'    # goawk 1.6.0
3

# for reference:
$ locale
LANG="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_CTYPE="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_ALL=

So I feel good that GoAWK has the majority on its side :-), but as you point out, the `nawk/mawk/goawk/ behavior does seem wrong according to POSIX. What do you recommend, @arnoldrobbins?

Mar 15 '20 20:03 benhoyt

You ask:

I wonder why (onetrueawk) awk's length() function counts bytes instead of characters

The answer is that onetrueawk is not POSIX compliant, at least in this regard. Check out the last entry in the BUGS section of its man page, which states that it only supports eight-bit characters. A similar consideration applies to mawk.

I recommend that goawk be POSIX compliant and work with characters, not bytes. In the worst case, you may also wish to adopt gawk's -b option, although I don't know that I actively recommend that.

Hope this helps.

Mar 15 '20 20:03 arnoldrobbins

I've been thinking on and off about the best way to implement this. There will be a significant performance penalty for those functions (index, length, match, and substr): because Go/GoAWK stores AWK strings as UTF-8 encoded Go strings, those functions will have to do counting / UTF-8 decoding, rather than just returning a byte index or using a string slice operation in the case of substr. So several of those operations that are currently O(1) will become O(N) in the length of the string.

Given the performance penalty (and the fact that this will be a backwards-incompatible change), I think it should be opt-in. So I could make it a compile-time option (Go build tag) to switch from bytes to characters. Or it could be a Config.UseChars or similar runtime configuration option, and include the code to do it both ways. I think I'm leaning towards the latter, a runtime config option. (We'll probably want a way for the goawk binary to use both, and perhaps even use character by default with the goawk -b option as Arnold suggests.)

Jan 01 '22 21:01 benhoyt

Hi. I am afraid that I disagree with you.

Firstly, POSIX compatibility in this day and age wins out over backwards compatibility. Users of Asian languages (for example) expect these functions to work on characters.

Secondly, I firmly believe that awk programs are I/O bound, and the "performance penalty" you are worried about in practice is down in the noise where it can barely be measured, if at all.

Gawk has been multibyte compatible for well over a decade, and no one has complained about performance issues related to these functions.

You do want to be careful when looking for \n to separate records and whitespace to separate fields to use byte operations and not wide characters.

Feel free to discuss with me more in private email, or on Github if you really want.

Jan 02 '22 08:01 arnoldrobbins

@arnoldrobbins Yeah, on further thought I think that's fair enough. It shouldn't break many scripts, as you'd have to use constant indexes in substr() and have non-ASCII input for the parts you're substr()ing. My guess is that most usages either pass indexes from index/substr/match to substr (non-constant) or have known ASCII inputs for the parts they're substr()ing.

For Go usage, I'm going to make this a runtime config setting (Config.Bytes), which will default to false, i.e., default to character mode. I've also added a -b option to the goawk command, ala Gawk. Code at https://github.com/benhoyt/goawk/pull/83.

Jan 02 '22 21:01 benhoyt

Reopening this as I've unfortunately had to revert the above fix, due to its problematic O(N) behavior, which made certain scripts O(N^2) and untenably slow. To fix properly, we'll need to come up with a more clever scheme than counting characters when the functions are called.

I wonder, does Gawk store strings as 32-bit runes (in default, non-bytes mode)?

Jan 25 '22 08:01 benhoyt

Gawk goes to a lot of work to make string handling efficient. First, strings are reference counted, to avoid lots of copying. Next, gawk caches the value of MB_CUR_MAX since (at least on glibc) that macro hides a function call, but the value won't change within a single execution.

Strings themselves consist of the reference count, some flag bits, a (char *, size_t) pair for the original (likely encoded) data, and a (wchar_t, size_t) pair for "runes". The size_t is the length, since gawk allows embedded zero characters in strings.

If in a single-byte environment (MB_CUR_MAX is 1) then gawk can be fat, dumb and happy and never convert to wchar_t. Otherwise, as soon as the wchar_t version of a string is needed (but not before), gawk does the conversion. From then on, it uses the wchar_t version if it's needed subsequently.

HTH.

Jan 25 '22 09:01 arnoldrobbins

@arnoldrobbins Thanks, that's very helpful. I figured Gawk would be a lot smarter about this. I could definitely do something similar in Go (though as you indicate, it does introduce a fair bit of complexity).

Does Gawk also store (alongside the string) whether a particularly has any non-ASCII characters in it? So that it can skip creating the wchar_t version in those cases?

Jan 25 '22 09:01 benhoyt

In answer to your last question, gawk doesn't store a flag for a string being ASCII / non-ASCII. The decision is based exclusively on MB_CUR_MAX being == 1 or > 1. In the latter case, when it needs to, it does the conversion and then works on the wchar_t representation for length() et al.

Aug 19 '22 07:08 arnoldrobbins

One other issue with using characters instead of bytes is that it's not clear what the length of a string is if it's not a valid string in your locale. For instead if we were using UTF8:

BEGIN { a = "\000abc"; print length(a)  }

The null byte shouldn't exist in a utf8 string.

Or even worse what if we truncate a two byte codepoint?

BEGIN { a = "\306\222"; print a; print length(a)  }

Should clearly print this for the 2 byte ƒ char

ƒ
1

But if we chop it up?

BEGIN { a = "\306"; print a; print length(a)  }

Gawk is polite enough to warn you "Invalid multibyte data detected. There may be a mismatch between your data and your locale." but it's not clear that there is a defined length here. Struggling with this in my own awk at the moment. @benhoyt or @arnoldrobbins if you find a satisfying answer please share (:

Mar 04 '23 20:03 n8ta

Yeah, this is interesting. I definitely don't have a satisfying answer -- GoAWK is still mostly bytes-oriented. Out of interest, is rawk your awk (Rust + AWK)?

Mar 05 '23 19:03 benhoyt

@benhoyt yup that's it. Just updated the readme to give a better idea of what the project is now. I was working on a J(awk)IT compiled version for awhile but all the JIT compilers had long startup times or lacked modern backends making them bad chocies for awk. So now it's an interpreter again (:

I don't really think there is a good answer for the character question without a change to the spec. What gawk does seems reasonable I'll probably end up implementing a warning as well. Keeping goawk bytes oriented seems reasonable to me but I can feel the pain of non-ascii writers.

Mar 05 '23 20:03 n8ta

Hi. W.R.T. a spec, you should send an interpretation request to the POSIX people. The gawk code is in node.c in str2wstr(). Basically, if gawk knows it's in a Unicode locale, it converts invalid bytes to 0xFFFD (or whatever is the "invalid character" code in Unicode). Otherwise it just drops the byte and keeps going. So for Unicode, at least, the length isn't off, even if the contents differ somewhat.

Almost any way to handle this kind of case will have different tradeoffs; I think what gawk does is reasonable, but maybe someone else would make a different decision.

@n8ta, I took a quick look at your project. Be aware that mawk's regex matcher may be fast, but it's limited to byte matching; you'll need another option if you really want to support Unicode. Also, I hope that having ARGV be read-only is only temporary. In any case, good luck w/your project!

Mar 06 '23 18:03 arnoldrobbins

Hi, is there news about this issue?

print length("Céline") still returns 7 instead of 6

I use GoAWK under MS Windows. The file is UTF-8 encoded (as far as I understand it, GoAWK only works with UTF-8 encoded files). The project is great (thanks for that), but unfortunately I can only use it to a very limited extent because of this problem.

Mar 09 '24 20:03 datatraveller1

Hi @datatraveller1, unfortunately no news -- at least in my own usage (and many other usages I've seen), this distinction has never been important, because you're almost always using the result of length or RSTART as an input to substr, rather than hard-coded lengths. As a result -- and because doing a fast implementation is hard! -- I've deprioritised this.

However, I'm very interested to see what your use case is in context. Are you able to share the context, or your full AWK script with some notes on what it's doing? We may be able to at least find a workaround.

Mar 10 '24 00:03 benhoyt

@benhoyt I think you're making a premature optimization. Awk programs in general are I/O bound, and the cost of computing things like byte lengths vs. character lengths and conversion back and forth should be down in the noise. I suspect that if you profile, you'll find this to be true.

In general, users expect their programs to follow the standards and it's a multibyte world today. Go is pretty good about byte to character conversions, you shouldn't have to do a lot of the work manually the way gawk does.

Mar 10 '24 04:03 arnoldrobbins

@arnoldrobbins Two things:

I/O used to be the bottleneck, but in the last decade or two that's no longer the case (more in my article here). Sequential read speed is fast, especially on SSD drives -- and normally the processing you're doing can't keep up with the disk.

I just ran my count-words program under GoAWK with the profiler turned on, and less than 2% of the time was spent in OS read calls (and that's after flushing the read cache using sudo /sbin/sysctl -w vm.drop_caches=3). Here's a screenshot of the profile graph -- you can see only 1.75% on the very left is in File.Read, all the rest is map accesses, interpreter stack handling, converting to lowercase, and parsing fields ... the typical stuff AWK programs need to do.

Screenshot from 2024-03-11 08-28-09

If you go back on the thread above, I did actually implement this the simple way (in #83), but that meant length and substr and similar functions went from O(1) to O(N), which meant that scripts that were using those in a loop went from O(N) to O(N^2), in other words, caused "accidentally quadratic" behaviour in real scripts, such as that described in #93 ... running time went from 1 second to 8 minutes in a test case that wasn't even that large!

To keep length and substr O(1), I'd need to decode the UTF-8 input into codepoints (characters) up front, which is a cost I'm not sure I want everyone to pay. But I haven't benchmarked that approach yet.

So I think there's three possible ways forward for GoAWK:

Go back to the approach above, and tell people whose scripts get the quadratic behaviour to get over it. I don't think this is reasonable, given the "1 second to 8 minutes" example above.
Decode UTF-8 to codepoints always on the way in. This would mean I can no longer use Go string functions for most things, and would slow everything down a bunch, but at least length and substr and friends would remain O(1).
Come up with some complicated string representation that makes things fast and O(1). For example, when reading input, mark whether all the bytes in the a string are ASCII and use normal strings, otherwise mark the string as non-ASCII, and lazily decode it from UTF-8 to characters only if you need to. This is a lot of work.

Overall, I still would like to do this, but I'm not sure the trade-offs are worth it given how infrequently it's a problem in practice. As I mentioned above, I'm almost always using the result of length or RSTART as an input to substr, rather than hard-coded lengths, so as long as everything's UTF-8, Unicode handling works just fine. From that PoV, I'd still love to see @datatraveller1's use case or other non-contrived use cases where that's not true.

Mar 10 '24 19:03 benhoyt

Hi @benhoyt and Arnold, thank you very much for your answers. It was difficult to create a simple example because the script is complex (the actual script uses fixed length input files) and the extracted, simplified part doesn't seem to make sense, but here it is:

file input.txt with one line content (please note the typo): Céline Doin in concert

# script change.awk:
function replace_substring(string, substr_pos, substr_length, replace_with) {
  return substr(string,1,substr_pos-1) replace_with substr(string,substr_pos+substr_length)
}

{
  replacement = "Dion"
  line = replace_substring($0, 8, 4, replacement)
  print line > "output.txt"
}

With GAWK and SET LC_ALL=en_US.UTF-8 the result is correct: gawk -f change.awk input.txt => output.txt with content Céline Dion in concert With GOAWK the result is wrong: goawk -f change.awk input.txt => output.txt with content CélineDionn in concert

Mar 10 '24 21:03 datatraveller1

Hi @datatraveller1, thanks for that. Maybe I need to see more context, but it's the hard-coded constants like 8 and 4 that I don't understand, and tend not to use in my own scripts. For example, if I wanted to replace Doin with Dion, I'd use sub or similar:

awk '{ sub("Doin", "Dion"); print }' input.txt

# or, if you just want to look at the second field ("last name")
awk '{ sub("Doin", "Dion", $2); print }' input.txt

That way there are no magic constants involved. Perhaps you can include a bit more context to show why your script requires those?

All that said, I would like to make this happen, it's just that it's tricky (as noted in my previous long comment).

Mar 12 '24 07:03 benhoyt

Hi @benhoyt thank you very much.

The problem is that I often have fixed length fields and use constructs like: LINES[NR] = replace_substring(LINES[NR], 41, 120, sprintf("%-120.120s", title))

I know about sub and gsub and sure there may be workarounds, but it would be nicer if the user could rely on proper multibyte handling for length and substr etc.

I don't know Go, but I wonder if there are native Go UTF-8 string functions that could be used for GoAWK so that the multibyte tasks could be implemented a bit more easily?

Mar 12 '24 18:03 datatraveller1

@datatraveller1 Understood, and fair enough if you're using fixed-length fields.

There are straight-forward UTF-8 string functions in Go to do all this, and I've done it before (see above), but the change in performance from O(1) to O(n) for these operations wasn't great, so I reverted it.

Mar 12 '24 18:03 benhoyt

@benhoyt I think you have to go with a simplified version of option (3) from earlier, which is keep both the []byte and string representation and a flag saying whether or not the string version is valid. You convert from []byte to string lazily, that is, only when you need it. This is how gawk works. At least Go does most of the conversion work for you.

Although I understand the desire to be performant, I think that correctness should trump performance. My two cents, of course. And, as always, the gawk code is available for seeing how I did it and I'm happy to answer questions, privately or in a public forum.

Mar 12 '24 18:03 arnoldrobbins

goawk goawk copied to clipboard

Reconsider handling of bytes vs characters in length() and similar functions

goawk
goawk copied to clipboard