sh icon indicating copy to clipboard operation
sh copied to clipboard

expand: Variable cannot represent indexed arrays with gaps

Open twilly opened this issue 4 years ago • 6 comments

A Bash arrays can start at any index, though it counts the number of set elements as part of the length.

For example, Bash says:

$ x=([1]=a); echo ${#x[@]}
1

gosh says:

$ x=([1]=a); echo ${#x[@]}
2

expand.Variable.List being a []string is not flexible enough to describe these bespoke arrays. A zero string is a valid expansion after all.

twilly avatar Feb 24 '21 02:02 twilly

Yikes, I wasn't aware that Bash indexed arrays are... not actually arrays. Here's another example:

$ empty=("" "foo") missing=([1]="bar")
$ echo "${empty[@]}" "${#empty[@]}"
 foo 2
$ echo "${missing[@]}" "${#missing[@]}"
bar 1

You can even have gaps anywhere in the array, not just the beginning:

$ middle=([1]="one" [3]="three")
$ echo "${middle[@]}" "${#middle[@]}"
one three 2

The question then becomes - how do we represent indexed arrays in an efficient way? We could give up entirely and use a map[int]string or the existing map[string]string directly, but as you can guess, maps have an inherent overhead when compared to a plain old slice.

Maybe the answer is just that indexed arrays in Bash are practically as costly as associative arrays, since both seem to behave like maps.

mvdan avatar Feb 24 '21 10:02 mvdan

I'm trying to think if we could possibly fix this in v3 without breaking programs. I don't think so, because the Variable.List field is exposed directly as a field instead of a method.

mvdan avatar Feb 24 '21 10:02 mvdan

There's one hack we could possibly do in v3. Fill the "gaps" with string values which are impossible to have in shell, such as a single null/zero byte:

$ a=$'\x00'
$ echo "${a} ${#a}"
 0

Then, the length of an indexed array means looping over List to count the non-null elements. Similarly, expansion would have to skip over such gaps.

This is a hack, but I think it should be backwards compatible - any direct users of expand.Variable would have their behavior changed for arrays with gaps, but those didn't work properly to begin with. So no breakages should happen in practice.

I also wonder if this hack is actually better in the long run than alternatives like map[int]string or []struct{idx int; val string}. Assuming that gaps are rare, and that one would rarely leave immense gaps like arr=([10000]=foo), then it makes sense to keep a contiguous slice representation.

mvdan avatar Feb 24 '21 10:02 mvdan

C strings strike again. Null as a sentinel value makes sense.

The first data structure that came to mind is a skip list. It would take less memory but lookup and insertion is log n. Perhaps v4 can hide this detail where gapless arrays are a straightforward slice and sparse arrays are a skip list/something else.

twilly avatar Feb 24 '21 20:02 twilly

With the assumption that gaps will be extremely rare, or at least not huge, I would rather avoid implementing a skip list entirely. Having two separate implementations would also add code, which I'd like to avoid.

With the assumption above, I think the "null" strings for the gaps should work okay. It could possibly cause memory issues for e.g. arr=([10000]=foo), but I doubt that actually happens in practice.

I also think that optimizing the storage of variables should be taken with a pinch of salt. Variable values in Bash are strings anyway, so there's always going to be an inherent overhead with variables and arrays.

mvdan avatar Feb 25 '21 13:02 mvdan

Totally fair. No real complaints from me with only a null sentinel implementation. :)

twilly avatar Feb 25 '21 17:02 twilly