nebuly
nebuly copied to clipboard
Generate custom dataset from few user samples
Description
The first huge difficulty for training an AI assistant is to get a dataset reach enough and big enough for starting the training at all.
ChatLLaMA needs three different type of data:
- Instruction + human label for supervised fine-tuning of the Agent
- Text example + human evaluation (score) for training the reward model
- Unlabeled instructions to be used in RLHF
In case of a ChatBot the Instruction should contain
- the Prompt for the model, describing the task it should perform
- Previous chat interactions
- User command
Given a few examples from the user we would like to generate synthetic data, which should be “aligned” with the user data.
TODO
- [ ] Implement a function for analysing user data and produce the dataset needed for the Agent training
- [ ] Implement a data-generator for the reward model taking as input the “Rules” to be used in the scoring functions. Rules must be written in a single txt-like file.
- [ ] Integrate generated datasets with available open-source datasets.
- [ ] Write unittest for the data-generation function
`def analyze_user_data(user_data): # Define the columns of the dataset columns = ['age', 'gender', 'location', 'interests', 'purchase_history', 'intent']
# Initialize an empty list to store the data
dataset = []
# Loop through each user in the data
for user in user_data:
# Extract relevant information from the user data
age = user['age']
gender = user['gender']
location = user['location']
interests = user['interests']
purchase_history = user['purchase_history']
intent = user['intent']
# Create a new row for the dataset
row = [age, gender, location, interests, purchase_history, intent]
# Append the row to the dataset
dataset.append(row)
# Return the dataset as a pandas DataFrame
return pd.DataFrame(dataset, columns=columns)
This function i created takes a list of user data as input and analyzes each user's information to create a dataset for agent training. The columns of the dataset are defined in the columns variable, and an empty list called dataset is initialized to store the data.
The function loops through each user in the user_data list and extracts relevant information such as age, gender, location, interests, purchase history, and intent. A new row is created for each user, and the row is appended to the dataset list.
Finally, the function returns the dataset list as a pandas DataFrame with the columns defined in the columns variable.