oss-fuzz-gen icon indicating copy to clipboard operation
oss-fuzz-gen copied to clipboard

Use introspector to retrieve context.

Open trashvisor opened this issue 11 months ago • 16 comments

Port context retrieval to use introspector APIs (moving from ASTs to DWARF debug information).

trashvisor avatar Mar 07 '24 13:03 trashvisor

There are a few issues preventing this from working at the moment.

Some include... Enums not being a part of the information we receive from FI. Enums are occasionally used in types. FI only returning a single element for a type, when multiple are typically required to be embedded to get a syntactically sound program (.e.g. typedef+struct). Fix is already in place from FIs end but not currently live (I think).

trashvisor avatar Mar 07 '24 23:03 trashvisor

Would it be smarter to wait for an API that gives types recursively in FI? I think we should be able to this ready for Monday 11th

DavidKorczynski avatar Mar 07 '24 23:03 DavidKorczynski

hey @DavidKorczynski that was something we were going to ask about on Monday actually, it would definitely be strongly preferred

trashvisor avatar Mar 07 '24 23:03 trashvisor

Sounds good -- I'll aim to have something ready on FIs side for then.

DavidKorczynski avatar Mar 07 '24 23:03 DavidKorczynski

GKE JOB (added --context to run_all_experiments.py in docker_run.sh): https://pantheon.corp.google.com/kubernetes/job/us-central1/llm-experiment-large/default/weekly-manual-mar-10/logs?project=oss-fuzz

REPORT: https://llm-exp.oss-fuzz.com/Result-reports/scheduled/2024-03-10-weekly-all/sort.html BUCKET: https://pantheon.corp.google.com/storage/browser/oss-fuzz-gcb-experiment-run-logs/Result-reports/scheduled/2024-03-10-weekly-all

DonggeLiu avatar Mar 09 '24 13:03 DonggeLiu

JOB: https://pantheon.corp.google.com/kubernetes/job/us-central1-c/llm-experiment/default/ofg-pr-144-dg REPORT: https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-03-12-144-dg-comparison/index.html BUCKET: https://pantheon.corp.google.com/storage/browser/oss-fuzz-gcb-experiment-run-logs/Result-reports/ofg-pr/2024-03-12-144-dg-comparison BUCKET GS: gs://oss-fuzz-gcb-experiment-run-logs/Result-reports/ofg-pr/2024-03-12-144-dg-comparison

DonggeLiu avatar Mar 12 '24 02:03 DonggeLiu

Looking at https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-03-12-150-dg-comparison/benchmark/output-avahi-avahi_string_list_add_pair,

I'm seeing :

typedef __uint8_t uint8_t;

typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
AvahiStringList * avahi_string_list_add_pair(AvahiStringList *, const char *, const char *);

Is there some logic we need to add to avoid duplicating the types? @DavidKorczynski

oliverchang avatar Mar 12 '24 21:03 oliverchang

Looking at https://llm-exp.oss-fuzz.com/Result-reports/ofg-pr/2024-03-12-150-dg-comparison/benchmark/output-avahi-avahi_string_list_add_pair,

I'm seeing :

typedef __uint8_t uint8_t;

typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
typedef struct AvahiStringList {
    struct AvahiStringList *next; /**< Pointer to the next linked list element */
    size_t size;  /**< Size of text[] */
    uint8_t text[1]; /**< Character data */
};
AvahiStringList * avahi_string_list_add_pair(AvahiStringList *, const char *, const char *);

Is there some logic we need to add to avoid duplicating the types? @DavidKorczynski

This is probably an issue on my end. The API we are going to use to query types is going to be changed very soon(there will be an endpoint which supports recursively retrieving type information). The reason this PR was proposed before that change was made was mostly because we initially wanted something out asap to do some fine-tune testing but I don't think that is required at the moment (if my understanding is correct).

trashvisor avatar Mar 12 '24 23:03 trashvisor

This is probably an issue on my end. The API we are going to use to query types is going to be changed very soon(there will be an endpoint which supports recursively retrieving type information). The reason this PR was proposed before that change was made was mostly because we initially wanted something out asap to do some fine-tune testing but I don't think that is required at the moment (if my understanding is correct).

Thanks for the explanation! Sorry I haven't been following this too closely lately. I have another question.

Is the rationale behind the recursive types essentially so we can include all the self contained definitions inline in the targets?

the current prompt structure:

You must insert this code before the function being tested:
<code>
... defs ...
</code>

seems to indicate so.

Should we instead see if we can just include the correct header files instead? It seems like trying to recursively expand/inline the necessary definitions can get hairy very quickly.

IIRC, the header file for a function prototype is difficult to get accurately, but the header file for types are not? Is this true @DavidKorczynski ?

If this is true, it may be better to instead do something like:

You must include these header files in the solution:
<code>
// (sourced from the types associated with the function signature
#include "foo.h"
#include "Bar.h" 
</code>

And also include the following function signature:
<code>
void foo(int *blah);
</code>

Even if the signature is already defined in one of the headers, I believe it's perfectly valid to repeat function signatures.

oliverchang avatar Mar 13 '24 05:03 oliverchang

Is the rationale behind the recursive types essentially so we can include all the self contained definitions inline in the targets?

Yup!

Should we instead see if we can just include the correct header files instead? It seems like trying to recursively expand/inline the necessary definitions can get hairy very quickly.

IIRC, the header file for a function prototype is difficult to get accurately, but the header file for types are not? Is this true @DavidKorczynski ?

If this is true, it may be better to instead do something like:

You must include these header files in the solution:
<code>
#include "foo.h"
#include "Bar.h"
</code>

And also include the following function signature:
<code>
void foo(int *blah);
</code>

Even if the signature is already defined in one of the headers, I believe it's perfectly valid to repeat function signatures.

Really good idea, can totally do that. In fact file information for a type is already exposed. I'll mess around with that idea.

trashvisor avatar Mar 13 '24 05:03 trashvisor

Is there some logic we need to add to avoid duplicating the types?

Upon review, it appears to be a problem with FI.

Multiple items are returned for the same type in a "duplicated" manner. E.g. Slight variations in source files which normalise to the same path.

  1 {"type": "typedef", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/./avahi-common/strlst.h", "source_line": "44"}},
  2 {"type": "struct", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/./avahi-common/strlst.h", "source_line": "40"},
  3  {"type": "typedef", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/avahi-common/./strlst.h", "source_line": "44"}},
  4 {"type": "struct", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/avahi-common/./strlst.h", "source_line": "40"}

trashvisor avatar Mar 13 '24 06:03 trashvisor

Is there some logic we need to add to avoid duplicating the types?

Upon review, it appears to be a problem with FI.

Multiple items are returned for the same type in a "duplicated" manner. E.g. Slight variations in source files which normalise to the same path.

  1 {"type": "typedef", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/./avahi-common/strlst.h", "source_line": "44"}},
  2 {"type": "struct", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/./avahi-common/strlst.h", "source_line": "40"},
  3  {"type": "typedef", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/avahi-common/./strlst.h", "source_line": "44"}},
  4 {"type": "struct", "name": "AvahiStringList", "source": {"source_file": "/src/avahi/avahi-common/./strlst.h", "source_line": "40"}

CC @DavidKorczynski . Seems like we can deduplicate the header paths pretty easily ourselves though.

To clarify my last comment: having the definitions for the structs is still useful context for the target to generate target code that can set up the inputs correctly. My main point is just that we shouldn't be relying on that to make things compile since that seems rather error prone and there are likely a lot of edge cases to consider.

I think we'd still want to have ultimately something like:

You must include these header files in the solution:
<code>
// (sourced from the types associated with the function signature
#include "foo.h"
#include "Bar.h" 
</code>

And also include the following function signature:
<code>
void foo(Type  *blah);
</code>

These are the definitions of the relevant types:
<code>
// We don't have to get this perfectly right, or include every single little dependency. 
struct Type {
  ...
}
</code>

These are examples of how the types are constructed and how the target function is called:
<code>
.. xrefs ...
</code>

Use these as inspiration for the solution.

oliverchang avatar Mar 13 '24 06:03 oliverchang

Should we instead see if we can just include the correct header files instead? It seems like trying to recursively expand/inline the necessary definitions can get hairy very quickly.

We've pushed some changes to make looking up type information easier in FI based on raw DWARF info in https://github.com/ossf/fuzz-introspector/pull/1471 -- there is also an API "addr-to-recursive-dwarf-info" that can be used to extract full type hierarchy for a given type, e.g. the types used in a function signature. There will still be some work on the users end, for example deciding how deep in the type tree one is interested in looking.

IIRC, the header file for a function prototype is difficult to get accurately, but the header file for types are not? Is this true @DavidKorczynski ?

Correct, we don't have the location for a given function prototype defined in a header file, but we have locations for all types (caveat: elements in an enum are not considered types, the enum is a type in and of itself, so we need to have heuristics here).

DavidKorczynski avatar Mar 13 '24 10:03 DavidKorczynski

@trashvisor I updated a new feature to main, so I rebased this PR, fixed the merge conflict, and made a minor naming modification, hope you would not mind : )

DonggeLiu avatar Mar 28 '24 12:03 DonggeLiu

BTW, I suppose we should merge this with the --context flag disabled.

It's better to merge it to avoid potential merge conflicts. Better to disable the flag as the performance is not 100% ready.

DonggeLiu avatar Apr 12 '24 05:04 DonggeLiu

BTW, I suppose we should merge this with the --context flag disabled.

It's better to merge it to avoid potential merge conflicts. Better to disable the flag as the performance is not 100% ready.

that sounds good, i'll clean the code up then

trashvisor avatar Apr 12 '24 05:04 trashvisor

@trashvisor I pushed a small change in https://github.com/google/oss-fuzz-gen/pull/144/commits/5ac4932e7924a049c0b6b6a48ff32c2577f0bb9f to:

  • use jinja2 for the template formatting. this enables conditional insertion of context only when they exist.
  • add a --dry-run flag to run_all_experiments.py which only generates prompts. this can be used to easily look at the results of context-enhanced prompts locally without actually running the full experiment.

oliverchang avatar May 06 '24 04:05 oliverchang

@trashvisor I pushed a small change in 5ac4932 to:

  • use jinja2 for the template formatting. this enables conditional insertion of context only when they exist.
  • add a --dry-run flag to run_all_experiments.py which only generates prompts. this can be used to easily look at the results of context-enhanced prompts locally without actually running the full experiment.

thanks for this. will be ultra ultra helpful

trashvisor avatar May 07 '24 04:05 trashvisor

/gcbrun request_pr_exp.py -n dg-context

DonggeLiu avatar May 07 '24 05:05 DonggeLiu