jazzer
jazzer copied to clipboard
A bug when performing gep-related instrumentation.
https://github.com/CodeIntelligenceTesting/jazzer/blob/3e0e4f177fdb8ecff1c707fed83c16c094b181c9/agent/src/main/java/com/code_intelligence/jazzer/instrumentor/TraceDataFlowInstrumentor.kt#L117-L130
When reading the code above, I found that the gep instrumentations are performed only after a "constant integer push" (see Line 122 and Line 128), which means we instrument when accessing an array element via a constant index.
However, Clang's document says the opposite, i.e., we need to instrument every non-constant array index (https://clang.llvm.org/docs/SanitizerCoverage.html#tracing-data-flow). So, I wonder if this is a bug or I misunderstand something.
// Called before a GetElemementPtr (GEP) instruction
// for every non-constant array index.
void __sanitizer_cov_trace_gep(uintptr_t Idx);
Thanks for taking a deeper look at the instrumentation.
In an internal precursor to Jazzer, we indeed instrumented all array index operations, in particular including those with variable indices. Since our bytecode analysis is not as powerful as that of the LLVM IR, we couldn't easily detect writes to arrays at constant indices, so these were also instrumented. I remember performing some small experiments (there is no Java FuzzBench) in which I found this to "spam" the table of recent compares a bit too much: Loops over (parts) of the input array would quickly fill up the table with consecutive indices, which didn't seem very useful compared to the information inserted by other instrumentation types such as switches. At that point, I decided to deviate from the SanCov implementation and rather use these hooks to collect information about "magic" offsets.
Given that we have no way yet to compare variations of Jazzer against each other on a larger scale, this may very well be suboptimal though. I would be very interested in hearing your thoughts on this. Do you happen to have more insights into the purpose and performance of the GEP instrumentation for native code? I have unsucessfully tried to find more context or ideally FuzzBench data on it.
Hi, many thanks for your reply, which lets me know more about jazzer. Since both Jazzer and Libfuzzer are new to me, I still have some questions.
-
I am not sure what you mean by "we couldn't easily detect writes to arrays at constant indices". In my opinion, it could be easy to identify writes to arrays at constant indices, because we can check if the array index is pushed to java stack using bytecodes such as iconst and bipush instead of iload.
-
I agree with you that instrumenting all array indexing operations may spam the table of recent compares too much. Perhaps, we can do some sampling. For example, given a loop
for (int i = 0; i < x; ++i) a[i]
, we may only instrument wheni
is an even number, i.e.,i = 2, 4, 6, 8...
. This may alleviate the space issue. -
For the purpose of GEP instrumentation for non-constant index array operations in native code, I think this is because non-constant indices have a higher possibility to cause buffer overflows than constant indices. And I think the other reason is that a fuzzer aims to generate test inputs that may trigger bugs. If an index is a constant, it is not related to the test inputs and, thus, cannot give any guidance for the fuzzer to produce new inputs. How do you think?
I have a possibly-related question. Please let me know if this is the best place to discuss, or if I should create a separate issue. With that out of the way, here goes!
I've been using Jazzer + JUnit + ZXing to fuzz Okapi Barcode, a barcode encoding library (basically feed it a string, get back an image of a barcode). I've found a nice handful of bugs, thanks for the great tool!
This type of barcode library is full of lookup tables (i.e. final static int[ ]
/ char[ ]
/ string[ ]
that are never modified). So for example instead of a long sequence of if (input == 'a') encode("1100111") else if (input == 'b') encode("00011000") else ...
you'll instead have something like encode(TABLE[input])
. As such, the LOC coverage information doesn't really guide the problem space exploration particularly well, because a significant portion of the real coverage information is really array index coverage, not LOC coverage (i.e. all entries in the lookup tables should be used). See here for a simple example of such a lookup loop.
For a while I wasn't sure if the GEP description of "constant array indexes" instrumentation referred to "constant arrays" or "constant indexes", but I'm now pretty sure that it's referring to "constant indexes". However, I'm particularly interested in variable indexes into constant arrays.
Which leads to my main question -- if instrumenting all array index operations is too noisy, is it possible that limiting it to all index operations on constant (final static) arrays would be both useful and not too noisy?
This problem might benefit from enabling "value profile" (see https://github.com/CodeIntelligenceTesting/jazzer/blob/main/docs/advanced.md#value-profile).
If not, you could also try using Jazzers exploreState()
function (see the MazeRunner on how to use it: https://github.com/CodeIntelligenceTesting/jazzer/blob/main/examples/src/main/java/com/example/MazeFuzzer.java).
You might have to use it in combination with custom hooks (https://github.com/CodeIntelligenceTesting/jazzer/blob/main/docs/advanced.md#value-profile).
Thanks for the insights! It doesn't look to me like the "value profile" option would help much, but the exploreState
+ custom hooks option does sound promising. Unfortunately, it looks like I'd have to wrap all of the "interesting" array lookups in one-liner methods in order to be able to intercept them... or is there something similar to @MethodHook
that would intercept IALOAD bytecodes instead of method invocations, with information about the index and array referenced?
Hi @gredler ! Just following up on your question. We are not exactly sure how to solve it, and would need more insight from you. Would you be willing to tell us more? You can reach me here to proceed: david[dot]merian [at] code-intelligence[dot]com