voxelgpt icon indicating copy to clipboard operation
voxelgpt copied to clipboard

[WIP] Synthetic examples

Open jacobmarks opened this issue 1 year ago • 0 comments

Start of concept for generating synthetic examples for a dataset based on the specific fields and values. This is meant to partially address the problem that currently, GPT doesn't always know the schema of your dataset - especially when it comes to non-standard fields like filepath and metadata, or non-label fields.

It is implemented via an FieldExampleGenerator class, which randomly generates field-type specific examples from templates. There is flexibility so that this can be used for any field type. The only things that need to be done to add a new field type are:

  1. Fill the self.patterns dictionary. The keys should be the patterns to fill, and the values should be the function objects which generate their replacements. These functions need to be implemented, but they are typically one line of code each.
  2. Set the self.example_templates attribute, which should be a list of dicts, each containing a query and a string-form list of view stages
  3. Change the self.filters attribute if needed - this defines the conditions used when turning these examples into a pandas DataFrame that we can filter later on.

This is what it looks like for string fields:

import fiftyone as fo
import fiftyone.zoo as foz
dataset = foz.load_zoo_dataset("quickstart")
from links.synthetic_example_generator import StringFieldExampleGenerator

## add a string field to the dataset
dataset.add_sample_field('my_field', fo.StringField)
view = dataset.set_field('my_field', F('ground_truth.detections.label')[0])
view.save('my_field')

num_examples = 5  ### num examples to generate
sfeg = StringFieldExampleGenerator(dataset, "my_field")
examples = sfeg.generate_examples(num_examples)

This results in the following:

[{'query': ' images where my_field is skis or traffic light',
  'stages': "[match(F('my_field').is_in(['skis', 'traffic light']))]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': 'Exclude the my_field field from all samples',
  'stages': "[exclude_fields('my_field')]"},
 {'query': ' images where my_field is sheep or tie',
  'stages': "[match(F('my_field').is_in(['sheep', 'tie']))]"},
 {'query': 'Only images that have my_field not equal to bed',
  'stages': "[match(F('my_field') != 'bed')]"}]

We will also want to only take the unique examples, so we don't get any duplicates.

To fold this in to the rest of the code, the workflow would look something like this:

  1. Given the dataset, generate field-type specific examples for each field (obv excluding the default ones)
  2. Compute the embeddings for these and store them separately
  3. In our example selection link, instead of selecting the 40 top examples from the generic candidates, get the top, e.g. 30 of those, and the top e.g. 10 of these dataset-specific examples

jacobmarks avatar Jun 01 '23 23:06 jacobmarks