Ensure get_contents() method of Node objects return 'bytes' and fix decoding errors in Value.get_csig()
Fix #3737
Using Value as a target node sometimes fails in the calculation of its content signature in get_csig().
Most existing get_contents() methods already return bytes/bytearrays, but a few are returning strs. This is causing problems in some functions (e.g. Value.get_text_contents()) as strs do not have a decode() method.
The current implementation of Value.get_csig() uses the return of Value.get_text_contents() as the csig. The method Value.get_text_contents() calculates its return value by converting the Value's value attribute to a string, and then appending its children's content. Child node content are calculated by calling their get_contents() to get their binary data, and then calling .decode() on that. This can cause a UnicodeDecodeError if the binary content of a child node is not valid utf-8.
To fix this, I have looked through the get_contents() method in each Node subclass (the ones I could find at least) and made appropriate changes to make sure they always return bytes. The Dir and Alias nodes construct their binary content entirely as a string, they now return the utf-8 encoding of their original return values.
I have also reimplemented Value.get_contents() similarly to how Value.get_text_contents() works. I don't think this changes the output of Value.get_contents().
I've also reimplemented Value.get_csig() to use Value.get_contents() and decode() that (with errors='backslashreplace') to use as the csig.
Alternatively, Value.get_text_contents() could be modified to work around decoding errors (using something like errors='backslashreplace'). I'm not sure if get_text_contents() is meant to always return something, even if the binary contents of the node is not encoded text. The implementation of File.get_text_contents() uses errors='backslashreplace' to handle decoding errors.
I've added two rough tests with duplicated code to test that Value.get_contents() and Value.get_csig() work even when it has a child node with non-utf8 binary contents.
I've also modified tests that were comparing get_contents() against strings for equality, and some mock objects that were returning strings in their get_contents() method. I don't think all the modifications to the test code were necessary to get my changes to pass the tests, but I've modified them anyway for completeness.
Contributor Checklist:
- [ ] I have created a new test or updated the unit tests to cover the new/changed functionality.
- [x] I have updated
CHANGES.txt(and read theREADME.rst) - [ ] I have updated the appropriate documentation
Please add a blurb to CHANGES.txt
Just a note this seems to be related to https://github.com/SCons/scons/pull/3384 as well.
I would love to have these fixes in
The matches variable in ActionCallerTestCase.test_get_contents() also looks unused after being redefined on line 2043:
https://github.com/SCons/scons/blob/9861322b2306f17fc9a9037791fa689427c7275e/SCons/ActionTests.py#L2043-L2055
Should I remove it?
Does this also cover issue #3093?
I'd say hold off on any more work on this until we determine if anything breaks if we remove including child nodes contents in get_contents().
Seeing this stalled out completely. I've been gradually sneaking in typing annotations in the codebase - we can do more now that 3.6 is the baseline Python versions, I could really only add return types before. One of the things that shows up as a consistency check problem - it would be nice to declare all get_contents as returning bytes, and all get_text_contents as returning str. Is there a way to move this PR forward? Or do we need to start over? And sneak up on it incrementally?
I believe the sticking point from last time was what Value.get_text_contents() should return: concatenation of all child contents (old behaviour, possibly large memory usage) or concatenation of child csigs (breaks documented behaviour).
Using concat of child csigs leads to some tests to fail, and I'm not sure if it was figured out what the rationale for using the concat of child contents was. I had run out of time to look into these two issues unfortunately.
On the topic of ensuring get_contents() returns bytes and get_text_contents() returns str, I can roll back the commits messing with Value.get_text_contents() and that should be good to go. Just need to make sure it integrates with the existing codebase now.
Deciding what Value.get_text_contents() should return can be discussed in a separate issue.
Honestly it should do neither. The node's children shouldn't be added into it's contents. The scanner/taskmaster should check that the children of the Value() are out of date or not.. I think I have a WIP somewhere for this. Just need to get back to it..