data-api-builder icon indicating copy to clipboard operation
data-api-builder copied to clipboard

CosmosDB : GQL Schema Generation with Sampling

Open sourabh1007 opened this issue 1 year ago • 6 comments

Refer this document for details Schema Inference Design.docx

What is this change?

Add a utility to generate schema for No SQL database. It will be based on best efforts.

PR Code Change Summary

Main Code

Samplers:

  • [ ] src/Core/Generator/Sampler/PartitionBasedSampler.cs
  • [ ] src/Core/Generator/Sampler/TimeBasedSampler.cs
  • [ ] src/Core/Generator/Sampler/TopNSampler.cs

Schema Generator

  • [ ] src/Core/Generator/SchemaGenerator.cs : Generates GQL out of given set of JSON array a) Add Alias name b) tag container entity or an entity with alias as @model c) Mark an attribute is nullable or not i.e. !

Utility Classes

  • [ ] src/Core/Generator/Sampler/CosmosExecutor.cs : It is responsible to run a CosmosDB Query
  • [ ] src/Core/Generator/SchemaGeneratorFactory.cs : a) Create connection with Cosmos DB b) Runs Required Sampler c) Generates Schema

Export Command

  • [ ] src/Cli/Commands/ExportOptions.cs
  • [ ] src/Cli/Exporter.cs

Test Coverage

  • [ ] Samplers: src/Service.Tests/CosmosTests/SamplerTests.cs a) Might notice that, I am creating item with the gap of 1 sec, as Sampler queries are using _ts and it is auto generated by cosmosDB (and there is no way to control it). So, waiting time is added to get the different values for this column
  • [ ] SchemaGenerator: src/Service.Tests/CosmosTests/SchemaGeneratorTest.cs

Other Changes

  • [ ] src/Cli/CustomLoggerProvider.cs : Minimal Logger level to Debug as I added few debug logs to get more insights (if required) on, what is happening in sampler.

How can you test it?

  1. Go to this location, here you will find the data builder executable:

image

  1. --help provides all the available options

image

  1. Below is the minimal command required to run this feature

TopNSampler image

PartitionBasedSampler image

TimeBasedSampler image

How was this tested?

  • [x] Integration Tests
  • [x] Unit Tests

sourabh1007 avatar Jul 08 '24 11:07 sourabh1007

Please update the PR description

abhishekkumams avatar Jul 10 '24 07:07 abhishekkumams

just curious since i haven't fully looked through: make sure this feature is feature flagged and doesn't just execute by default

seantleonard avatar Jul 15 '24 16:07 seantleonard

just curious since i haven't fully looked through: make sure this feature is feature flagged and doesn't just execute by default

this feature is not part of DAB flow. Customer has to run it explicitly and generate the gql.

sourabh1007 avatar Jul 23 '24 15:07 sourabh1007

/azp run

sourabh1007 avatar Aug 01 '24 07:08 sourabh1007

@sourabh1007 can you add samples for all 3 modes, I have few feedbacks on the cli command variables and also on the PR, will add once you have the samples in PR

sajeetharan avatar Aug 08 '24 12:08 sajeetharan

/azp run

sourabh1007 avatar Aug 10 '24 00:08 sourabh1007

@sourabh1007 What is the sequence of commands here? I see that all the sampling methods require config file as the input, however dab init which is the initial command to generate the config, which cannot be executed without a schema file.

sajeetharan avatar Aug 12 '24 17:08 sajeetharan

@sourabh1007 What is the sequence of commands here? I see that all the sampling methods require config file as the input, however dab init which is the initial command to generate the config, which cannot be executed without a schema file.

Generation of schema file is very first command (if required) then, after that it is normal flow.

sourabh1007 avatar Aug 13 '24 09:08 sourabh1007

/azp run

sourabh1007 avatar Aug 17 '24 00:08 sourabh1007

@sourabh1007 What is the sequence of commands here? I see that all the sampling methods require config file as the input, however dab init which is the initial command to generate the config, which cannot be executed without a schema file.

Generation of schema file is very first command (if required) then, after that it is normal flow.

What is the experience if customers don't want to use the schema auto generation part?

sajeetharan avatar Aug 19 '24 15:08 sajeetharan

Also, can you please create separate PRs for the changes of CLI and the DAB engine?

sajeetharan avatar Aug 19 '24 15:08 sajeetharan

Also, can you please create separate PRs for the changes of CLI and the DAB engine?

I’ve already provided an explanation of the files included in the PR. Could you please let me know if there's anything specific that’s unclear or if there's an issue you'd like me to address? I'm happy to help.

sourabh1007 avatar Aug 21 '24 02:08 sourabh1007

Also, can you please create separate PRs for the changes of CLI and the DAB engine?

I’ve already provided an explanation of the files included in the PR. Could you please let me know if there's anything specific that’s unclear or if there's an issue you'd like me to address? I'm happy to help.

Please separate the CLI behavior changes and the DAB engine changes for Cosmos DB into different PRs. This will make them easier to manage. I’ll leave the decision up to you and @seantleonard . Also, could you address the comment on the first question?

sajeetharan avatar Aug 21 '24 04:08 sajeetharan

pr fine as is, no need to break up at this point.. I Just need time to go in and review latest changes. In future, big changes need to be broken up, even with change descriptions.

seantleonard avatar Aug 21 '24 15:08 seantleonard

/azp run

sourabh1007 avatar Aug 22 '24 08:08 sourabh1007

@sourabh1007 can we rename the param names as below for schema extraction?

  1. TopNExtractor
  2. TimePartitionedSampler
  3. EligibleDataSampler

sajeetharan avatar Aug 29 '24 05:08 sajeetharan

/azp run

sourabh1007 avatar Sep 02 '24 06:09 sourabh1007

/azp run

sourabh1007 avatar Sep 03 '24 02:09 sourabh1007

\azp run

sourabh1007 avatar Sep 04 '24 02:09 sourabh1007

/azp run

sourabh1007 avatar Sep 04 '24 16:09 sourabh1007

/azp run

sourabh1007 avatar Sep 05 '24 02:09 sourabh1007