cvise
cvise copied to clipboard
`IncludesPass` and `CommentsPass` have issues with files using "non-standard" encodings
For:
cat unit.c
/*
* für
*/
file unit.c
unit.c: ISO-8859 text
running cvise --start-with-pass IncludesPass gives:
00:00:00 INFO ===< 118085 >===
00:00:00 INFO running 24 interestingness tests in parallel
00:00:00 INFO INITIAL PASSES
00:00:00 INFO ===< IncludesPass >===
00:00:00 WARNING IncludesPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< UnIfDefPass >===
00:00:00 INFO ===< CommentsPass >===
00:00:00 WARNING CommentsPass has encountered a non fatal bug: pass got stuck
00:00:00 INFO ===< IfPass >===
Traceback (most recent call last):
File "/usr/bin/cvise", line 286, in <module>
reducer.reduce(pass_group, skip_initial=args.skip_initial_passes)
File "/usr/share/cvise/cvise.py", line 143, in reduce
self._run_additional_passes(pass_group['first'])
File "/usr/share/cvise/cvise.py", line 166, in _run_additional_passes
self.test_manager.run_pass(p)
File "/usr/share/cvise/utils/testing.py", line 503, in run_pass
self.state = self.current_pass.new(self.current_test_case, self.check_sanity)
File "/usr/share/cvise/passes/ifs.py", line 36, in new
bs = BinaryState.create(self.__count_instances(test_case))
File "/usr/share/cvise/passes/ifs.py", line 22, in __count_instances
for line in in_file.readlines():
File "/usr/lib64/python3.8/codecs.py", line 322, in decode
(result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xfc in position 6: invalid start byte
(this was after editing cvise/utils/testing.py to set GIVEUP_CONSTANT to 1)
and creates:
Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************
IncludesPass has encountered a bug:
pass got stuck
state: 2
Please consider tarring up cvise_bug_0
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.
***************************************************
and
Package: cvise 2.3.0
Git version: unknown
LLVM version: 12.0.0
System: uname_result(system='Linux', node='vistrrdslin0001', release='5.12.13-1-default', version='#1 SMP Mon Jun 28 06:37:23 UTC 2021 (74bd8c0)', machine='x86_64', processor='x86_64')
***************************************************
CommentsPass has encountered a bug:
pass got stuck
state: -1
Please consider tarring up cvise_bug_1
and creating an issue at https://github.com/marxin/cvise/issues and we will try to fix the bug.
***************************************************
Well, it's technically a bug, but I don't see why would somebody use a non-utf8 encoding for files? Doctor it hurts. Then don't do it :D
I don't see why would somebody use a non-utf8 encoding for files?
The tool my company builds (and one of the reasons I use cvise) automatically builds data-driven unit test "wrappers" for C/C++ -- as part of this, we just "gobble up" whatever code our customers have.
Our customers 100% have code that isn't utf8/ascii (the above comes from a support case I was helping with).
I think that adding something like:
- https://github.com/VectorSoftwareRI/atg_support_scripts/blob/e62cfa5f3c2ac870bf84fe367a6df98d9b336623/atg_execution/strip_unchanged_attributes.py#L32
- https://github.com/VectorSoftwareRI/atg_support_scripts/blob/e62cfa5f3c2ac870bf84fe367a6df98d9b336623/atg_execution/strip_unchanged_attributes.py#L160
(code I wrote) would make cvise more robust here.
Would you be happy if I made some utility function like safe_open that wraps open along with a call to get_file_encoding)? I'm thinking that something like:
- https://github.com/marxin/cvise/blob/38b42d95a5280d890008489f1d31f297bd920e6d/cvise/passes/ifs.py#L21
could then use safe_open vs. "regular" open.
Thoughts?
Alternatively maybe cvise should check if a file is openable as utf8 before starting, and "fail early" (e.g., as part of the initial sanity check) if it isn't?
Thank you for your ideas. So you're using a Python package chardet which seems reasonable to me.
A question: if you have a non-utf8 test-case, what about converting it first to utf8 and then let cvise run? At the end you can convert it to a given encoding.
The problem I see with safe_open is that we'll need to use it at various places. Similarly for the writing of the output.
How about if cvise (as part of the "initial sanity check") validates if the file can be opened as utf8? If it can't, we have three options:
- Bail-out and tell the user "sorry, unsupported"
- Add a flag that says "convert to utf8 and try as normal, I understand if the initial tests now might fail and
cvisewill give up" - Add a flag that says "automatically skip any transformations that need utf8"
We could actually do all three of these; without 2 or 3, 1 happens. If you have 2, then 3 is redundant. If you have 3, 2 is redundant (and it won't run things like IncludesPass).
Thoughts?
Well, I like doing 2) using chardet. That's going to help in most cases (hopefully)..
Well, I like doing 2) using
chardet. That's going to help in most cases (hopefully)..
@andrewvaughanj I've just implemented the first version, can you please take a look?
I can try this next week; should args.to_utf8 have a "sanity check" afterwards?
Well, the conversion does happen before the normal sanity check. Thanks for testing.
Any update about the testing, please?
Any update about the testing, please?
PING :)
Any news about this @andrewvaughanj ?