ast icon indicating copy to clipboard operation
ast copied to clipboard

Changing size of `nvflag` in `struct Namval` breaks things

Open krader1961 opened this issue 5 years ago • 12 comments

The discussion about support for JSON in issue #820 notes that symbols NV_JSON and NV_JSON_LAST are aliases for other symbols. Which is dangerous and shouldn't be necessary. So I opened PR #1037 to start work on eliminating such aliases. The next step in that cleanup is to change the definition of struct Namval member nvflag to be wider than an unsigned short. However turning it into a 32 bit unsigned int breaks unit tests. Even with no changes to the bit field symbols.

That should be impossible. That is, the size of that bit field should have zero effect on the behavior of the code as long as the size is wide enough to accommodate all bit values stored in that structure member. And converting from unsigned short (16 bits) to uint32_t (32 bits) is guaranteed to satisfy that constraint. This diff breaks 18 unit tests on macOS that pass without this change:

diff --git a/src/cmd/ksh93/include/nval.h b/src/cmd/ksh93/include/nval.h
index 66462dd9..ee293a5b 100644
--- a/src/cmd/ksh93/include/nval.h
+++ b/src/cmd/ksh93/include/nval.h
@@ -104,7 +104,7 @@ struct Namdecl {
 struct Namval {
     Dtlink_t nvlink;        // space for cdt links
     char *nvname;           // pointer to name of the node
-    unsigned short nvflag;  // attributes
+    unsigned int nvflag;  // attributes
 #if _ast_sizeof_pointer >= 8
     uint32_t nvsize;  // size or base
 #else

Something is seriously wrong with the code that allocates or uses a struct Namval. The aforementioned change should have zero effect on the behavior of the code other than the amount of memory used.

krader1961 avatar Dec 01 '18 05:12 krader1961

Running this on macOS results in the "double-free" ASAN failure shown below:

meson -DASAN=true
ninja >x 2>&1
meson test --setup=asan -t4 exit

The ast_realloc() call and subsequent attempt to free the same storage a second time is highly suggestive of a miscalculation of the size of a namval structure. Which is a problem we've seen in other contexts. And is why I opened issue #1002.

==26382==ERROR: AddressSanitizer: attempting double-free on 0x607000003350 in thread T0:
    #0 0x10451c1bd in wrap_free (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x551bd)
    #1 0x1042bd32b in ast_free vmbusy.c:50
    #2 0x103e5adbf in b_cd cd_pwd.c:298
    #3 0x1040f80fb in sh_exec xec.c:1102
    #4 0x104103965 in sh_exec xec.c:1792
    #5 0x10400bf0b in exfile main.c:515
    #6 0x104010082 in sh_main main.c:313
    #7 0x103e57525 in main pmain.c:41
    #8 0x7fff79e9508c in start (libdyld.dylib:x86_64+0x1708c)

0x607000003350 is located 0 bytes inside of 74-byte region [0x607000003350,0x60700000339a)
freed by thread T0 here:
    #0 0x10451c387 in wrap_realloc (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x55387)
    #1 0x1042bd233 in ast_realloc vmbusy.c:40
    #2 0x104031c63 in nv_putval name.c:1692
    #3 0x103e5ab65 in b_cd cd_pwd.c:292
    #4 0x1040f80fb in sh_exec xec.c:1102
    #5 0x104103965 in sh_exec xec.c:1792
    #6 0x10400bf0b in exfile main.c:515
    #7 0x104010082 in sh_main main.c:313
    #8 0x103e57525 in main pmain.c:41
    #9 0x7fff79e9508c in start (libdyld.dylib:x86_64+0x1708c)

previously allocated by thread T0 here:
    #0 0x10451609c in wrap_strdup (libclang_rt.asan_osx_dynamic.dylib:x86_64+0x4f09c)
    #1 0x1040db1a5 in sh_subshell subshell.c:526
    #2 0x103fedbff in comsubst macro.c:2005
    #3 0x103ff04fa in varsub macro.c:1126
    #4 0x103fe1536 in copyto macro.c:599
    #5 0x103fdde44 in sh_mactrim macro.c:186
    #6 0x104014dcb in sh_setlist name.c:284
    #7 0x1040f4117 in sh_exec xec.c:931
    #8 0x10400bf0b in exfile main.c:515
    #9 0x104010082 in sh_main main.c:313
    #10 0x103e57525 in main pmain.c:41
    #11 0x7fff79e9508c in start (libdyld.dylib:x86_64+0x1708c)

krader1961 avatar Dec 01 '18 06:12 krader1961

I verified that changing the size of the struct Namval member nvflag breaks the code even without the changes in PR #1037. Something is seriously FUBAR involving how this structure is used.

krader1961 avatar Dec 01 '18 06:12 krader1961

Very curious. Adding padding to either side of the nvflag member does not cause any problems. So it isn't the size of the struct or the offset of the nvflag member that is the issue. Looking at all the places nvflag is used or modified I don't see any obvious reason changing its size (i.e., width) would cause these failures. Especially since none of the values assigned to it have been changed.

krader1961 avatar Dec 02 '18 00:12 krader1961

Bingo! I added a strategic printf to line 292 in cd_pwd.c to display the pwdnod->nvflag value. When that struct member is an unsigned short it is either 0x3200 or 0x2200. When it is an unsigned int it is 0x103200 or 0x102200. WTF! The 0x100000 is this symbol:

#define NV_NOFAIL 0x100000   // return 0 on failure, no msg

Which is only meant to be passed as a flag to nv_open(). It is not meant to get merged into the nvflag member of a Namval_t. I wonder how many bugs of this nature exist in the code. I'll be surprised if this is the only example. Also, there has to be a related bug because the presence of that unexpected bit in the pwdnod->nvflag value should not have affected whether it was freed. That is, it appears some other portion of the code is incorrectly determining if the NV_NOFREE bit is set. Equivalently, since an otherwise undefined nvflag bit was set it should not have affected the behavior.

krader1961 avatar Dec 02 '18 03:12 krader1961

OMFG! See the setall() function in src/cmd/ksh93/bltins/typeset.c. The code deliberately conflates symbols that can be safely stored in a nvflag member of a Namval_t with symbols that cannot. It simply assumes the latter values won't be stored because they cannot be represented in the nvflag type (i.e., an unsigned short) and will thus be silently dropped by the compiler when it does the assignment. That is, truncating an unsigned int to an unsigned short. This is far too clever.

Worse is there are two non-nvflag symbols that are valid in an unsigned short:

#define NV_ADD 8             // add node if not found
#define NV_IDENT 0x80        // name must be identifier

The former corresponds to NV_UTOL while the latter is currently not used as a nvflag bit value.

There are likely other places in the code which assumes such value truncations on assignment will silently strip unwanted bits.

krader1961 avatar Dec 02 '18 03:12 krader1961

The former corresponds to NV_UTOL while the latter is currently not used as a nvflag bit value.

That turns out not to be true. There are at least three places, and probably more, where NV_IDENT is explicitly used as a bit in the nvflag member of Namval_t. See the src/cmd/ksh93/sh/subshell.c module. Those uses have nothing to do with the original intent of this symbol to require that the nv_open() function validate the name is a valid identifier. So once again we see a deliberate overloading of a symbol solely because it was convenient to do so despite it making the meaning of the code less clear.

krader1961 avatar Dec 03 '18 02:12 krader1961

This code is a white hot mess. Consider this definition src/cmd/ksh93/include/name.h:

#define NV_BLTIN (NV_NOPRINT | NV_EXPORT)

Yet its related symbol is in src/cmd/ksh93/include/nval.h:

#define NV_BLTINOPT NV_ZFILL  // mark builtins in libcmd

Those two symbols should be in the same header (nval.h). And that is true for all other NV_ symbols in name.h such as NV_NOPRINT. And none of them should be defined in terms of unrelated bit values.

krader1961 avatar Dec 03 '18 03:12 krader1961

Another OMFG bogosity. This symbol is meant to only be used as a flag to nv_open():

#define NV_NOREF NV_REF       // don't follow reference

See man src/cmd/ksh93/nval.3. It is not meant to be a valid nvflag value. But simply redefining it to #define NV_NOREF (1 << 24), so it is no longer a valid nvflag bit, breaks the code. I can't even begin to wrap my mind around why the person who wrote this code thought overloading NV_REF this way was a good idea. This is also true of #define NV_ASSIGN NV_NOFREE and probably other symbols.

This explains why so many bugs are being found by ASAN, Coverity Scan, and debug malloc implementations. The writers of this code were being far too clever. The situation wouldn't be nearly as bad had they used proper interfaces that validated their arguments rather than preprocessor macros.

krader1961 avatar Dec 03 '18 03:12 krader1961

FWIW, My intent is to widen nvflag from an unsigned short to an unsigned 32-bit int. And make every symbolic value valid for that struct member to have bit 31 set. Which will make it possible to distinguish between values valid as a Namval_t attribute from those meant to be used in calls to nv_open(), etc. but not stored directly in the nvflag struct member. This is far from perfect and will allow many aliasing bugs to slip through. But it would be a significant step towards properly separating these name spaces.

krader1961 avatar Dec 03 '18 05:12 krader1961

I confirmed that the only places that modify (struct Namval).nvflag are these three functions: nv_setattr(), nv_offattr(), nv_onattr(). So I instrumented those functions. The first two never attempt to merge illegal bits into the bit field. The latter, however, always does so. And when it attempts to turn on specific attrs the only one ever present is NV_EXPORT; and even that is often not present meaning the call is a no-op. The problematic calls are all from here:

https://github.com/att/ast/blob/1cf4ed06b5cfa06d71577799acdb0d966595a0b4/src/cmd/ksh93/sh/name.c#L1371

Note the masking with NV_ATTRIBUTES defined here:

https://github.com/att/ast/blob/1cf4ed06b5cfa06d71577799acdb0d966595a0b4/src/cmd/ksh93/include/name.h#L562-L563

That is the sole use of the symbol. The intent appears to be to ensure those specific attrs are not turned when the aforementioned nv_onattr() call occurs. Why? I have no idea and I'll bet if you asked the old AST team they probably couldn't give the correct answer. Also, note that NV_NOSCOPE, NV_NOARRAY, NV_ASSIGN, NV_VARNAME, and NV_STATIC all refer to bits that cannot be represented in an unsigned short. The only symbols in that list which can be present in the nvflag structure member is NV_ARRAY, NV_IDENT (an alias for NV_MISC) and NV_REF. So the correct definition is

#define NV_ATTRIBUTES ~(NV_ARRAY | NV_IDENT | NV_REF)

That only some of the NV_* symbols that are not valid in an nvflag are in that definition means it is broken in two ways: 1) those invalid symbols aren't necessary due to the implicit truncation to 16 bits, and 2) if those invalid symbols are necessary then quite a few more need to be included.

krader1961 avatar Apr 28 '19 01:04 krader1961

I pondered the code I mentioned in my previous comment some more. It is clear the only reason it also masks out the symbols that are otherwise legal in a nvflag is because they are aliased to symbols that only have meaning when calling nv_open(). For example, #define NV_NOREF NV_REF. This is why APIs like nv_open() need to be changed so as not to conflate the two sets of symbols.

krader1961 avatar Apr 28 '19 03:04 krader1961

Also, NV_ARRAY is never present in the flags value when nv_onattr(np, flags & NV_ATTRIBUTES) is executed. So it is almost a certainty that at some point in the past (but not now) NV_ARRAY was aliased to a symbol that was only meaningful for the behavior of nv_open() rather than being intended to be stored in nvflag. Yet another example of bit-rot in the code and why mixing (or conflating) namespaces in this fashion is a really bad idea.

krader1961 avatar Apr 28 '19 03:04 krader1961