Open-Assistant
Open-Assistant copied to clipboard
Create script to fill database with mock-data
Write a python script that fills the backend postgresql DB with random message-trees and users. Insert 500k messages (each on averge 500 characters long) of 1k randomly selected mock-users (all numbers parameters). See the seed data generation script for inspiration. Report the query time of the current leaderboard stats query on your system.
Background: We discussed some possible strategies for dealing with DB queries. e.g. querying user stats live/online vs. incrementally updating peristent views (e.g. for user stats). We would like to test the performance of some naive count-queries, e.g. as in the get_user_leaderboard() of the PromptRepository class.
- This test code could be extended in the future to include message reactions, deleted messages and more.
Hello! Where do we have to extract the messages of the sample? All those have to be created on the fly each time, or maybe created once and store in a json. So when we fill the database, it is always the same messages?
Where do we have to extract the messages of the sample? All those have to be created on the fly each time, or maybe created once and store in a json. So when we fill the database, it is always the same messages?
I think for creating this many messages, it's ok if we just either use the same few messages over and over again, or we use something like faker to create fake data on the fly. makes no sense to store thousands and thousands of messages in json just to do load tests.
I've created initial version here #349. It fills DB with 1000 users, and has some basic setup for message tree generation. If anyone wants to join, feel free to jump in!
Okay cool would like to contribute! Will take a look into that @onegunsamurai.
I am working on this PR, which wants to:
- Store api clients
- Store users & messages with PromptRepository
related https://github.com/LAION-AI/Open-Assistant/issues/322
@andreaskoepf This is reworked version, even though - initial. #637 Built with DummyMessage class flow as a template, but added message trees and inheritance + recursive generation.
Right now successfully tested on creating users and messages, message_tree_states and tasks.
This is what i got from Chat GPT:
import psycopg2 import random import string import time
def random_string(string_length=500): """Generate a random string of fixed length.""" letters = string.ascii_letters + string.digits return ''.join(random.choice(letters) for i in range(string_length))
def insert_messages(conn, cursor): """Insert 500k messages into the database.""" user_ids = list(range(1, 1001)) # 1k mock users message_count = 500000
for i in range(message_count):
user_id = random.choice(user_ids)
message = random_string()
cursor.execute("INSERT INTO messages (user_id, message) VALUES (%s, %s)", (user_id, message))
conn.commit()
def report_query_time(conn, cursor): """Report the query time of the current leaderboard stats query.""" start_time = time.time() cursor.execute("SELECT user_id, COUNT() FROM messages GROUP BY user_id ORDER BY COUNT() DESC") query_time = time.time() - start_time print("Query time: {:.2f} seconds".format(query_time))
if name == 'main': # Connect to the database conn = psycopg2.connect( host="localhost", database="postgres", user="postgres", password="password" ) cursor = conn.cursor()
# Insert messages into the database
insert_messages(conn, cursor)
# Report the query time of the leaderboard stats query
report_query_time(conn, cursor)
# Close the database connection
cursor.close()
conn.close()
I close this because we are live and we already have a production database with >100k messages and >20k users now.