nebuly icon indicating copy to clipboard operation
nebuly copied to clipboard

Generate custom dataset from few user samples

Open diegofiori opened this issue 1 year ago • 1 comments

Description

The first huge difficulty for training an AI assistant is to get a dataset reach enough and big enough for starting the training at all.

ChatLLaMA needs three different type of data:

  • Instruction + human label for supervised fine-tuning of the Agent
  • Text example + human evaluation (score) for training the reward model
  • Unlabeled instructions to be used in RLHF

In case of a ChatBot the Instruction should contain

  • the Prompt for the model, describing the task it should perform
  • Previous chat interactions
  • User command

Given a few examples from the user we would like to generate synthetic data, which should be “aligned” with the user data.

TODO

  • [ ] Implement a function for analysing user data and produce the dataset needed for the Agent training
  • [ ] Implement a data-generator for the reward model taking as input the “Rules” to be used in the scoring functions. Rules must be written in a single txt-like file.
  • [ ] Integrate generated datasets with available open-source datasets.
  • [ ] Write unittest for the data-generation function

diegofiori avatar Mar 08 '23 13:03 diegofiori

`def analyze_user_data(user_data): # Define the columns of the dataset columns = ['age', 'gender', 'location', 'interests', 'purchase_history', 'intent']

# Initialize an empty list to store the data
dataset = []

# Loop through each user in the data
for user in user_data:
    # Extract relevant information from the user data
    age = user['age']
    gender = user['gender']
    location = user['location']
    interests = user['interests']
    purchase_history = user['purchase_history']
    intent = user['intent']

    # Create a new row for the dataset
    row = [age, gender, location, interests, purchase_history, intent]

    # Append the row to the dataset
    dataset.append(row)

# Return the dataset as a pandas DataFrame
return pd.DataFrame(dataset, columns=columns)

This function i created takes a list of user data as input and analyzes each user's information to create a dataset for agent training. The columns of the dataset are defined in the columns variable, and an empty list called dataset is initialized to store the data.

The function loops through each user in the user_data list and extracts relevant information such as age, gender, location, interests, purchase history, and intent. A new row is created for each user, and the row is appended to the dataset list.

Finally, the function returns the dataset list as a pandas DataFrame with the columns defined in the columns variable.

robertmalisa avatar Mar 16 '23 18:03 robertmalisa