Open-Assistant
Open-Assistant copied to clipboard
Fill DB with Mock Data
Script to fill the database with mock data.
Just noticed this is still in Draft stage .. sry for premature review.
Just noticed this is still in Draft stage .. sry for premature review.
No worries! Thanks for the feedback!
Just noticed this is still in Draft stage .. sry for premature review.
It is now on code review, I tested it also and works fine. @yk @andreaskoepf waiting for the feedback & merge when you believe so.
@danielpatrickhug does this interfere with what you did in #613 ?
@danielpatrickhug does this interfere with what you did in #613 ?
No, in fact it uses the realistic data and the work done for the creation of the messages, and is added to the creation of the other mock data.
@yk Hi, no, it doesn't look like it directly interferes w/ #613 it basically moves the seed_data logic into a FillDB, and the logic looks the same for message insertion except for a couple of changes to how the test users and API client is created. on startup, the seed_data function is still called in main.py and the realistic data should be inserted the same as it was in #613.
From what I remember about the MockDB ticket, it was supposed to be for adding 500,000+ messages and 1000+ users and testing the leaderboard query time. Filldb should also be able to be used with a much larger dataset to accomplish that, however during instantiation of filldb, you should be able to pass in the path to the JSON file storing the messages, currently, it is hard coded.
@yk Hi, no, it doesn't look like it directly interferes w/ #613 it basically moves the seed_data logic into a FillDB, and the logic looks the same for message insertion except for a couple of changes to how the test users and API client is created. on startup, the seed_data function is still called in main.py and the realistic data should be inserted the same as it was in #613.
From what I remember about the MockDB ticket, it was supposed to be for adding 500,000+ messages and 1000+ users and testing the leaderboard query time. Filldb should also be able to be used with a much larger dataset to accomplish that, however during instantiation of filldb, you should be able to pass in the path to the JSON file storing the messages, currently, it is hard coded.
Okay could do not hardcode the path to the dataset, so we could then pass another dataset that is bigger.
For the dataset of 500k+, should I use something similar as the dataset of anthropic's helpful test dataset that you used? If so, could you share which was the script you used to prepare the dataset?
Heey @andreaskoepf @yk, does the filling of the database already accomplished? So we can close this PR or I can keep working in it.