Open-Assistant icon indicating copy to clipboard operation
Open-Assistant copied to clipboard

Fill DB with Mock Data

Open nil-andreu opened this issue 2 years ago • 8 comments

Script to fill the database with mock data.

nil-andreu avatar Jan 05 '23 19:01 nil-andreu

Just noticed this is still in Draft stage .. sry for premature review.

andreaskoepf avatar Jan 06 '23 19:01 andreaskoepf

Just noticed this is still in Draft stage .. sry for premature review.

No worries! Thanks for the feedback!

nil-andreu avatar Jan 07 '23 09:01 nil-andreu

Just noticed this is still in Draft stage .. sry for premature review.

It is now on code review, I tested it also and works fine. @yk @andreaskoepf waiting for the feedback & merge when you believe so.

nil-andreu avatar Jan 12 '23 07:01 nil-andreu

@danielpatrickhug does this interfere with what you did in #613 ?

yk avatar Jan 12 '23 07:01 yk

@danielpatrickhug does this interfere with what you did in #613 ?

No, in fact it uses the realistic data and the work done for the creation of the messages, and is added to the creation of the other mock data.

nil-andreu avatar Jan 12 '23 08:01 nil-andreu

@yk Hi, no, it doesn't look like it directly interferes w/ #613 it basically moves the seed_data logic into a FillDB, and the logic looks the same for message insertion except for a couple of changes to how the test users and API client is created. on startup, the seed_data function is still called in main.py and the realistic data should be inserted the same as it was in #613.

From what I remember about the MockDB ticket, it was supposed to be for adding 500,000+ messages and 1000+ users and testing the leaderboard query time. Filldb should also be able to be used with a much larger dataset to accomplish that, however during instantiation of filldb, you should be able to pass in the path to the JSON file storing the messages, currently, it is hard coded.

danielpatrickhug avatar Jan 12 '23 14:01 danielpatrickhug

@yk Hi, no, it doesn't look like it directly interferes w/ #613 it basically moves the seed_data logic into a FillDB, and the logic looks the same for message insertion except for a couple of changes to how the test users and API client is created. on startup, the seed_data function is still called in main.py and the realistic data should be inserted the same as it was in #613.

From what I remember about the MockDB ticket, it was supposed to be for adding 500,000+ messages and 1000+ users and testing the leaderboard query time. Filldb should also be able to be used with a much larger dataset to accomplish that, however during instantiation of filldb, you should be able to pass in the path to the JSON file storing the messages, currently, it is hard coded.

Okay could do not hardcode the path to the dataset, so we could then pass another dataset that is bigger. For the dataset of 500k+, should I use something similar as the dataset of anthropic's helpful test dataset that you used? If so, could you share which was the script you used to prepare the dataset?

nil-andreu avatar Jan 13 '23 12:01 nil-andreu

Heey @andreaskoepf @yk, does the filling of the database already accomplished? So we can close this PR or I can keep working in it.

nil-andreu avatar Feb 21 '23 08:02 nil-andreu