vizro icon indicating copy to clipboard operation
vizro copied to clipboard

[QA] Vizro-ai base dashboard tests

Open l0uden opened this issue 1 year ago • 2 comments

Description

Added very simple checks for the dashboard creation and test for errors.

Notice

  • [x] I acknowledge and agree that, by checking this box and clicking "Submit Pull Request":

    • I submit this contribution under the Apache 2.0 license and represent that I am entitled to do so on behalf of myself, my employer, or relevant third parties, as applicable.
    • I certify that (a) this contribution is my original creation and / or (b) to the extent it is not my original creation, I am authorized to submit this contribution on behalf of the original creator(s) or their licensees.
    • I certify that the use of this contribution as authorized by the Apache 2.0 license does not violate the intellectual property rights of anyone else.
    • I have not referenced individuals, products or companies in any commits, directly or indirectly.
    • I have not added data or restricted code in any commits, directly or indirectly.

l0uden avatar Jul 31 '24 14:07 l0uden

Thanks for starting to write these - it is obvious that this is super hard to do.

One thing I am wondering is if we could potentially test our dashboard planner alongside e2e tests here. I think an e2e test of the dashboard functionality is also great, and we should keep it and refine it, but maybe it would make sense to test the planner as well, because it could potentially be more predictable.

To me it seems like the planner is the heart of this tool, and maybe it makes sense to run different complexities of requests against this planning function, and try to assess with what model performance we are happy.

E.g.

  • simple dashboard request with a card, table, and graph, plus one filter should be handled by all models
  • medium dashboard with more pages and controls should be handled by XYZ models
  • hard dashboard with layout and complex filtering maybe only by the very best models

What we test against could be the whole dashboard, or maybe even just the planner as mentioned above. We could for example check if there were three components requested of the correct type, plus a control of the correct type.

What I would really like to achieve is easily updatable tests of varying complexity where, as models get updated and improve, we can exchange the models, and get a good feel for what models we want to be able to achieve what things.

Big thanks for your thoughts! I think we should definitely have a meeting with @lingyielia to discuss all this things. For now we wanted this PR to be super simple and stable just for the start. And will improve the tests over time. Also I forgot to open it as a draft, it is not ready for review yet)

l0uden avatar Aug 01 '24 09:08 l0uden

Love the idea on checking dashboard plan, and a comprehensive integration test could be crucial for ensure the performance is always stable, especially when we introduce bigger features, refactor, and llm models from other vendors.

For example, one test set could potentially looks like this:

  Best senario baseline gpt-3.5 gpt-4o gpt-4-turbo mistral-large-latest
  the goal should always be true result could vary result could vary result could vary result could vary
What to check Create a 3-page dashboard. Page 1: 2 charts + 1 table + 3 filtersPage 2: 3 charts + 3 filtersPage 3: 2 cards + 1 chart + 1 filter - a dashboard object is created- this dashboard object can be launched without error- 3 pages are created- Page1: 3 components are present (might be 3 cards instead of charts and table, but as for baseline it’s ok)- Page2: 3 components present- Page3: 3 components present - ALL “baseline criteria” met - A performance score (based on how many Vizro models are created correctly) - ALL “baseline criteria” met - A performance score (based on how many Vizro models are created correctly) - ALL “baseline criteria” met - A performance score (based on how many Vizro models are created correctly) - ALL “baseline criteria” met- A performance score (based on how many Vizro models are created correctly)
Passing criteria 1 Dashboard3 Page6 Chart1 AgGrid2 Card7 Filter 1 Rangeslider1 Datepicker1 Checklist 1 Dashboard3 Page9 Component Score>6.0 (Score range 0-10) Score>9.5 (Score range 0-10) Score>9.5 (Score range 0-10) Score>9.6 (Score range 0-10)

Then testing the Dashboard plan itself, we need a similar set of score system 🤔

lingyielia avatar Aug 02 '24 13:08 lingyielia

Amazing! My only feedback is that I would have made it even easier. But we can create our "matrix of complexities" in later PRs!

Agreed. For this entry point test, it would be considered passed as long as a dashboard is up and running, with minimum requirements.

lingyielia avatar Aug 17 '24 02:08 lingyielia