Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Create script to fill database with mock-data

Open andreaskoepf opened this issue 2 years ago • 8 comments
trafficstars

Write a python script that fills the backend postgresql DB with random message-trees and users. Insert 500k messages (each on averge 500 characters long) of 1k randomly selected mock-users (all numbers parameters). See the seed data generation script for inspiration. Report the query time of the current leaderboard stats query on your system.

Background: We discussed some possible strategies for dealing with DB queries. e.g. querying user stats live/online vs. incrementally updating peristent views (e.g. for user stats). We would like to test the performance of some naive count-queries, e.g. as in the get_user_leaderboard() of the PromptRepository class.

  • This test code could be extended in the future to include message reactions, deleted messages and more.

andreaskoepf avatar Jan 03 '23 06:01 andreaskoepf

Hello! Where do we have to extract the messages of the sample? All those have to be created on the fly each time, or maybe created once and store in a json. So when we fill the database, it is always the same messages?

nil-andreu avatar Jan 03 '23 18:01 nil-andreu

Where do we have to extract the messages of the sample? All those have to be created on the fly each time, or maybe created once and store in a json. So when we fill the database, it is always the same messages?

I think for creating this many messages, it's ok if we just either use the same few messages over and over again, or we use something like faker to create fake data on the fly. makes no sense to store thousands and thousands of messages in json just to do load tests.

yk avatar Jan 03 '23 22:01 yk

I've created initial version here #349. It fills DB with 1000 users, and has some basic setup for message tree generation. If anyone wants to join, feel free to jump in!

onegunsamurai avatar Jan 03 '23 23:01 onegunsamurai

Okay cool would like to contribute! Will take a look into that @onegunsamurai.

nil-andreu avatar Jan 04 '23 07:01 nil-andreu

I am working on this PR, which wants to:

  • Store api clients
  • Store users & messages with PromptRepository

nil-andreu avatar Jan 05 '23 20:01 nil-andreu

related https://github.com/LAION-AI/Open-Assistant/issues/322

andreaskoepf avatar Jan 06 '23 18:01 andreaskoepf

@andreaskoepf This is reworked version, even though - initial. #637 Built with DummyMessage class flow as a template, but added message trees and inheritance + recursive generation.

Right now successfully tested on creating users and messages, message_tree_states and tasks.

onegunsamurai avatar Jan 11 '23 21:01 onegunsamurai

This is what i got from Chat GPT:

import psycopg2 import random import string import time

def random_string(string_length=500): """Generate a random string of fixed length.""" letters = string.ascii_letters + string.digits return ''.join(random.choice(letters) for i in range(string_length))

def insert_messages(conn, cursor): """Insert 500k messages into the database.""" user_ids = list(range(1, 1001)) # 1k mock users message_count = 500000

for i in range(message_count):
    user_id = random.choice(user_ids)
    message = random_string()
    cursor.execute("INSERT INTO messages (user_id, message) VALUES (%s, %s)", (user_id, message))

conn.commit()

def report_query_time(conn, cursor): """Report the query time of the current leaderboard stats query.""" start_time = time.time() cursor.execute("SELECT user_id, COUNT() FROM messages GROUP BY user_id ORDER BY COUNT() DESC") query_time = time.time() - start_time print("Query time: {:.2f} seconds".format(query_time))

if name == 'main': # Connect to the database conn = psycopg2.connect( host="localhost", database="postgres", user="postgres", password="password" ) cursor = conn.cursor()

# Insert messages into the database
insert_messages(conn, cursor)

# Report the query time of the leaderboard stats query
report_query_time(conn, cursor)

# Close the database connection
cursor.close()
conn.close()

nolofi avatar Feb 05 '23 07:02 nolofi

I close this because we are live and we already have a production database with >100k messages and >20k users now.

andreaskoepf avatar Feb 20 '23 12:02 andreaskoepf