voc Issues when trying to run tests on Windows

(This is on Windows 8.1; I imagine similar issues appear on other Windows systems)

I've had a try at running the tests under Windows, the main failure that I run into seems to be related to the default encoding under Windows (cp1252 rather than utf-8):

E AssertionError: 'utf-8' codec can't decode byte 0xff in position 10: invalid start byte

This seems to be an issue on both the Java and Python versions, see attached voc_output.txt, which is a dump of the main_code for both runAsPython() and runAsJava()

Another, possibly related issue is that Windows has a "charmap" encoding in it's terminals (cmd and powershell), which doesn't handle some unicode characters. I've also attached unicode_test.py.txt which blows up with a similar error.

In the process of debugging, I've also found a couple of places where utf-8 encoding seems to have been missed - pull request here: https://github.com/pybee/voc/pull/236

Aug 16 '16 01:08 AnthonyBriggs

I've been (slowly) chasing this up this evening, and making some progress through judicious commenting-out of exception handlers :) This is more of an infodump than anything, but might be helpful.

Commenting out the first exception handler in assertCodeExecution (lines 345/346), reveals that it's runAsJava that's throwing the dodgy string, specifically

`line = self.jvm.stdout.readline().decode("utf-8")`

on line 505. Printing out the return string from self.jvm.stdout.readline() shows that it's returning the hovercraft-is-full-of-eels string as

`b'>>> x = "M\xff h\xf4v\xe8r\xe7r\xe0ft \xee\xdf f\xfb?l \xf6f \xe9\xeal?"\n'`

which I'm pretty sure is not right. \xff is a beginning-of-unicode-string recognition character, for a start, which is why it's exploding. (For the record, it's supposed to be "Mÿ hôvèrçràft îß fûłl öf éêlś")

I've tried a few simple things to try and work out what's going on. Encoding the java output to cp1252 instead of utf-8 just turns it into

 `x = "M├┐ h├┤v├¿r├ºr├áft ├«├ƒ f├╗?l ├Âf ├®├¬l?"`

and various other combinations of encoding/decoding that string in a test script (without explicitly setting \xff didn't replicate the exact error.

I've done some light googling for "windows java stdout encoding" and "windows check powershell encoding", which turns up some potentially helpful info:

Default character encoding for java console output Powershell: Get default system encoding UTF-8 output from PowerShell (long)

Based on these, I've tried a few things:

setting [Console]::OutputEncoding = [System.Text.Encoding]::UTF8 to no avail
I did find that [System.Text.Encoding]::Default.EncodingName is set to Western European (Windows) which is cp1252/windows-1252/ANSI Latin 1; Western European (Windows).
Adding "-Dfile.encoding", "UTF-8", to the subprocess.Popen call didn't do anything.
neither did switching to decoding UTF-16

Anyway, not really sure what I'm doing here, but I'll keep trying things out until I figure out what's going on, or someone more knowledgable can jump in :)

Aug 22 '16 13:08 AnthonyBriggs

Update: it seems like I can recreate that specific error, but by writing a file out as UTF-16, and then back as UTF-8 (see attached script, test_encoding.py.txt). Perhaps the Windows Java does this by default on Windows?

In any case, it's progress, and late here, so I'll pick this up later if it's still open.

Aug 22 '16 13:08 AnthonyBriggs

You might be on to something with the UTF-16/8 thing. Internally, Java's string format uses an odd format called MUTF-8. The key feature of MUTF-8 is an odd way of encoding nulls.

I'm not sure why this would be manifesting on console output, and only on Windows - but it's worth some investigation.

Aug 23 '16 01:08 freakboy3742

voc voc copied to clipboard

Issues when trying to run tests on Windows

voc
voc copied to clipboard