pharo
pharo copied to clipboard
[CI] System variable environment on windows does not use adequate encodings
Related to the failure of https://github.com/pharo-project/pharo/pull/11347
Error
Illegal leading byte for utf-8 encoding
Stacktrace
ZnInvalidUTF8
Illegal leading byte for utf-8 encoding
ZnUTF8Encoder>>error:
ZnUTF8Encoder>>errorIllegalLeadingByte
ZnUTF8Encoder>>nextCodePointFromStream:
[ :stream |
[ byteStream atEnd ] whileFalse: [ | codePoint |
codePoint := self nextCodePointFromStream: byteStream.
(codePoint > 255 and: [ stream originalContents isWideString not ])
ifTrue: [ | wideString position |
position := stream position.
wideString := WideString from: stream originalContents.
stream on: wideString; setFrom: position + 1 to: position ].
stream nextPut: (Character value: codePoint) ] ] in ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes:
String class(SequenceableCollection class)>>new:streamContents:
String class(SequenceableCollection class)>>streamContents:
ZnUTF8Encoder(ZnUTFEncoder)>>decodeBytes:
ByteArray>>decodeWith:
ByteArray>>utf8Decoded
ExternalAddress>>fromCString
[
nextString := environmentStrings fromCString.
nextString ifEmpty: [ ^ self ].
nextString first = $=
ifFalse: [ self keysAndValuesDo: aBlock withAssociationString: nextString ].
environmentStrings := environmentStrings + nextString size + 1 ] in Win32Environment>>keysAndValuesDo:
FullBlockClosure(BlockClosure)>>repeat
Win32Environment>>keysAndValuesDo:
Win32Environment(OSEnvironment)>>asDictionary
OSEnvironmentTest>>testAsDictionary
OSEnvironmentTest(TestCase)>>performTest
It looks like environmentStrings fromCString is not considering the environment.
fromCString
| index aByte |
^ (ByteArray streamContents: [ :aStream |
index := 1.
[(aByte := self unsignedByteAt: index) = 0]
whileFalse: [
aStream nextPut: aByte.
index := index + 1]]) utf8Decoded
We see this for some PRs on the CI. Merging the PR then seems to not lead to any problems for the build, though.
example: https://github.com/pharo-project/pharo/pull/11662
failing tests on win for those:
windows-64 / Tests-windows-64 / testAsDictionary – Windows64.System.OSEnvironments.Tests.OSEnvironmentTest
<1s
windows-64 / Tests-windows-64 / testAssociations – Windows64.System.OSEnvironments.Tests.OSEnvironmentTest
<1s
windows-64 / Tests-windows-64 / testKeys – Windows64.System.OSEnvironments.Tests.OSEnvironmentTest
<1s
windows-64 / Tests-windows-64 / testValues – Windows64.System.OSEnvironments.Tests.OSEnvironmentTest
Could be related to https://github.com/pharo-project/pharo/issues/11665
I propose to skip this test on the CI for now
In a very old discussion, it was explained that this is very awkward.. we decode the strings as UTF8, but they are never UTF8. They can be UTF16 or "the current defined encoding". http://forum.world.st/Environment-variables-encoding-td5074185.html
primitiveGetenv returns values in the current locale's code page on Windows;
> a value bound to € returns a stings with single char 128 on MS1252 (western
> european) at least.
>
> On windows, there are three versions of each api call with string
> parameters/returns;
> xxx (depending on UNICODE being defined, either resolves to *A or *W)
> xxxA (Ascii, or, more accurately, current code page)*
> xxxW (UTF-16)
Possible solution:
- in Win32Environment>>#environmentStrings we can call explicitly GetEnvironmentStringsW
- decode the strings using the UTF16 decoder
Some more exploration:
- WinPlatform>>#getEnvironmentVariable:into:size: uses GetEnvironmentVariableW (the variant with the W suffix to force wide strings), so using the W suffix seems the way to go
- To encode and decode Windows widestring there is Win32WideString
So environmentStrings should use Win32WideString to decode instead of utf8StringFromCString, and it should work correctly.
(it looks like Win32WideString could be replaced by a (not yet existing) utf16StringFromCString these days, but that is another project)
I tried to apply those changes on my windows but it does not work.
Here is what I did:
Win32Environment>>environmentStrings
^ self ffiCall: #( void * GetEnvironmentStringsW () )
Win32Environment>>keysAndValuesDo: aBlock
"Under windows the environemtn variables are a single big String."
"Lines starting with an equal sign are invalid per
http://stackoverflow.com/questions/10431689/what-are-these-strange-environment-variables"
| environmentStrings nextString |
environmentStrings := self environmentStrings.
[
nextString := environmentStrings utf16StringFromCString.
nextString ifEmpty: [ ^ self ].
nextString first = $=
ifFalse: [ self keysAndValuesDo: aBlock withAssociationString: nextString ].
environmentStrings := environmentStrings + nextString size + 1 ] repeat
ExternalAddress>>utf16StringFromCString
"Assume that the receiver represents a C string and convert it to a byte array.
WARNING: the referenced data MUST ends with a NULL character (byte 0).
"
self isNull ifTrue: [ ^ '' ].
^ self bytesFromCString ifNotNil: [ :bytes | bytes decodeWith: (ZnCharacterEncoder newForEncoding: 'utf16') ]
ByteArray>>utf16StringFromCString
^ (ExternalData fromHandle: self type: ExternalType string) utf16StringFromCString
ExternalData>>utf16StringFromCString
"Assume that the receiver represents a C string containing UTF8 characters and convert
it to a Smalltalk string."
^ self bytesFromCString ifNotNil: [ :bytes | bytes decodeWith: (ZnCharacterEncoder newForEncoding: 'utf16') ]
Then if I try to run the tests I get:
ZnCharacterEncodingError: Incomplete utf-16 encoding
ZnUTF16Encoder(ZnCharacterEncoder)>>error:
ZnUTF16Encoder>>errorIncomplete
ZnUTF16Encoder>>read16BitWordFromStream:
ZnUTF16Encoder>>nextCodePointFromStream:
[ :stream |
[ byteStream atEnd ] whileFalse: [ | codePoint |
codePoint := self nextCodePointFromStream: byteStream.
(codePoint > 255 and: [ stream originalContents isWideString not ])
ifTrue: [ | wideString position |
position := stream position.
wideString := WideString from: stream originalContents.
stream on: wideString; setFrom: position + 1 to: position ].
stream nextPut: (Character value: codePoint) ] ] in ZnUTF16Encoder(ZnUTFEncoder)>>decodeBytes: in Block: [ :stream |...
String class(SequenceableCollection class)>>new:streamContents:
String class(SequenceableCollection class)>>streamContents:
ZnUTF16Encoder(ZnUTFEncoder)>>decodeBytes:
ByteArray>>decodeWith:
ExternalAddress>>utf16StringFromCString
Win32Environment>>keysAndValuesDo:
Win32Environment(OSEnvironment)>>asDictionary
OSEnvironmentTest>>testAsDictionary
ah, so maybe this is why there is Win32WideString ... can you try to use this?
Not really shure, but something like that:
nextString := (Win32WideString fromHandle: self environmentStrings) asString.
I also tried it and it also fails.
I'll go back on my windows soon to give you the stack if you want
If you include (at least the original) byte sequence we can try to figure out what encoding it is in.
BTW, you can write:
#[0 65 0 66] decodeWith: #utf16
I tried:
keysAndValuesDo: aBlock
"Under windows the environemtn variables are a single big String."
"Lines starting with an equal sign are invalid per
http://stackoverflow.com/questions/10431689/what-are-these-strange-environment-variables"
| environmentStrings nextString |
environmentStrings := self environmentStrings.
[
nextString := (Win32WideString fromHandle: environmentStrings) asString.
nextString ifEmpty: [ ^ self ].
nextString first = $= ifFalse: [ self keysAndValuesDo: aBlock withAssociationString: nextString ].
environmentStrings := environmentStrings + nextString size + 1 ] repeat
leading to:
WideString class(ProtoObject)>>primitiveFailed:
WideString class(ProtoObject)>>primitiveFailed
WideString class(Behavior)>>basicNew:
WideString class(String class)>>new:
WideString(SequenceableCollection)>>copyFrom:to:
WideString>>copyFrom:to:
WideString(SequenceableCollection)>>first:
Win32Environment(OSEnvironment)>>keysAndValuesDo:withAssociationString:
Win32Environment>>keysAndValuesDo:
Win32Environment(OSEnvironment)>>asDictionary
OSEnvironmentTest>>testAsDictionary
The first value of environmentStrings is #[176 94 206 11 0 0 0 0]
Btw, the tests are passing on my machine if I do not change anything so I'm not able to reproduce the original bug on my machine. But maybe because I have no character in my environment variables that can trigger the problem?
When I was still using #utf16StringFromCString, the decoder was failing on: #[240 214 206 11 0 0 0 0]
utf16StringFromCString
self isNull ifTrue: [ ^ '' ].
^ self bytesFromCString ifNotNil: [ :bytes | bytes decodeWith: #utf16 ]
#bytesFromCString is here returning #[61]
If you are getting from the Win32 API they are not valid UTF16 strings, they are multibyte but the encoding it is different, We should use Win32WideString to decode them using the win32 api.
On Thu, Mar 9, 2023 at 4:00 PM CyrilFerlicot @.***> wrote:
When I was still using #utf16StringFromCString, the decoder was failing on: #[240 214 206 11 0 0 0 0]
— Reply to this email directly, view it on GitHub https://github.com/pharo-project/pharo/issues/11407#issuecomment-1462206429, or unsubscribe https://github.com/notifications/unsubscribe-auth/AACWY772NDMODKZC3PVITQDW3HWAVANCNFSM5ZYDQORA . You are receiving this because you are subscribed to this thread.Message ID: @.***>
-- Pablo Tesone. @.***
Thanks @tesonep
I tried to do that using (Win32WideString fromHandle: environmentStrings) asString. but it fails with
WideString class(ProtoObject)>>primitiveFailed:
WideString class(ProtoObject)>>primitiveFailed
WideString class(Behavior)>>basicNew:
WideString class(String class)>>new:
WideString(SequenceableCollection)>>copyFrom:to:
WideString>>copyFrom:to:
WideString(SequenceableCollection)>>first:
Win32Environment(OSEnvironment)>>keysAndValuesDo:withAssociationString:
Win32Environment>>keysAndValuesDo:
Win32Environment(OSEnvironment)>>asDictionary
OSEnvironmentTest>>testAsDictionary
Dit I use it in a wrong way?
I changed the access to the Environment entries to use GetEnvironmentStringsW, and then all strings we are sure that they are Win32WideStrings. If we use the ASCII version, it might come with OEM characters, so there is not easy way of getting the encoding.
I found a case where this error happens also on OSX:

I opened
https://github.com/pharo-project/pharo/issues/13070
for the unix case