flow icon indicating copy to clipboard operation
flow copied to clipboard

Create unified fake datasets

Open norberttech opened this issue 11 months ago • 3 comments

I think we can create some very generic fake extractors and define a schema for them, for example

- df()->read(from_flow_orders(limit:1_000))
- flow_orders_schema() : Schema
- do the same for products
- do the same for customers
- do the same for inventory

We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.

This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)

I would put those into src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset

norberttech avatar Jan 04 '25 17:01 norberttech

The important part here is that those datasets can't be total random, they need to be fully predictable.

For example, Orders can't start from a random point in time, it should also be possible to configure on the extractor level how many orders per day, time period, % of cancelled orders etc.

norberttech avatar Jan 04 '25 17:01 norberttech

Do we want the datasets to be mainly a static file that we manipulate or maybe we could utilize libraries such as https://fakerphp.org/ so we can add some controllable randomness into the play?

For example, we could provide a "schema" to faker and then the faker will fill the data for us.

What do you think @norberttech ?

Bellangelo avatar Jan 04 '25 17:01 Bellangelo

Great question! IMO data should be 100% generated by faker but we should put some options as I explained above to make those datasets more predictable.

Tests using those datasets should not rely on the values but more on the shape and size of the data.

norberttech avatar Jan 04 '25 17:01 norberttech