Create unified fake datasets
I think we can create some very generic fake extractors and define a schema for them, for example
- df()->read(from_flow_orders(limit:1_000))
- flow_orders_schema() : Schema
- do the same for products
- do the same for customers
- do the same for inventory
We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.
This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)
I would put those into src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset
The important part here is that those datasets can't be total random, they need to be fully predictable.
For example, Orders can't start from a random point in time, it should also be possible to configure on the extractor level how many orders per day, time period, % of cancelled orders etc.
Do we want the datasets to be mainly a static file that we manipulate or maybe we could utilize libraries such as https://fakerphp.org/ so we can add some controllable randomness into the play?
For example, we could provide a "schema" to faker and then the faker will fill the data for us.
What do you think @norberttech ?
Great question! IMO data should be 100% generated by faker but we should put some options as I explained above to make those datasets more predictable.
Tests using those datasets should not rely on the values but more on the shape and size of the data.