deepstate
deepstate copied to clipboard
Investigate common usage patterns of strcmp, strlen, memcmp, and memcpy
Things to try.
String comparison
Make a symbolic-execution aware strcmp(a, b)
, but on the library side. What we're looking for is: is one or or both of a[0]
and b[0]
symbolic. In the case that one is, then I think what we want to try is produce two forks: one where the return value is 0
and we constrain the bytes of a
to be the bytes of b
. We probably need solver support here to ask questions like "can this range of memory be the same." The other fork will return a non-zero number in the range [CHAR_MIN, CHAR_MAX]
. We can do this via the following pattern:
if (DeepState_Bool()) {
int ret = DeepState_IntInRange(CHAR_MIN, CHAR_MAX);
DeepState_Assume(ret != 0);
// Assume they aren't different, but don't actually constraint the bytes
// We want to experiment here to see if it's reasonable that program
// behaviour, actually does the right thing. Perhaps we could also
// provide a deferred assertion mechanism, e.g. `DeepState_AssumeLater`.
return ret;
} else {
// constrain the bytes of one of the sides to be equal to the other
// then and provide a an interface to overwrite them in the backing
// memory store, e.g. `DeepState_WriteByte(addr, 0)` to write `0` to
// `addr`, ignoring any possible errors, and succeeding even if the
// memory is read-only.
return 0;
}
Recall
It is potentially worth it to implement strcmp
to have a record/recall mechanism, where we tell the runtime engine about each constant string, e.g. DeepState_RecordString(b)
if b
was non-symbolic above, then augment symbolic strcmp
s that forks by comparing against all prior recorded strings.
String lengths
Implement strlen(a)
in terms of whether or not a[0]
is symbolic. If it is, then provide an interface to get the minimum string length and the maximum string length, and then do a fork where we constrain the lengths, and use the DeepState_WriteByte
interface to force in a NUL
-terminator.
Lexers
Investigate the code patterns of lexers produced by lex/flex/ragel or yacc/bison/antlr, and perhaps also look into the lexers using by javascript or HTML parsing parsing engines. It's one thing to construct symbolic tokens and benefit from strcmp
-like optimizations described above, but it's another to materialize tokens out of thing air based on one's progress in a state machine.
Perhaps one benefit of using C/C++ is that we can do things like: __attribute__((weak)) yylex
, and use to give us possible access into the inner workings of parsers against which the program is linked.