azure-sdk-for-java icon indicating copy to clipboard operation
azure-sdk-for-java copied to clipboard

Is there a way to perform batch operations across databases and containers(Cosmos DB)

Open bhattacharyyasom opened this issue 2 years ago • 2 comments

Query/Question I am looking to perform operations across databases and containers to process a large data dump. Here is the situation,

  1. I receive a data dump(large with millions of records) that I import into a database/container(say a) owned by me
  2. I need to read the records one by ones and for each record in the feed I need to ,
    • Check for a value in the record in another container(say b) and database
    • If match is found then read from that other matching record in container B
    • Create a new document in a new container in DB a with values As you can see this whole flow above is 1 operation in the step. Since we have a huge data dump I am looking for the most efficient way of handling this.

Why is this not a Bug or a feature Request? I am not sure if this is feasibly and or other methods exist within the SDK.

Setup (please complete the following information if applicable):

  • OS: PCF deployment
  • IDE: IntelliJ
  • Library/Libraries: Any java library preferably Spring-data-cosmos

Information Checklist Kindly make sure that you have added all the following information above and checkoff the required fields otherwise we will treat the issuer as an incomplete report

  • [x] Query Added
  • [x] Setup information Added

bhattacharyyasom avatar Aug 11 '22 12:08 bhattacharyyasom

@kushagraThapar is there a way to use change feed processor to address the above use case ? Also is the change feed processor support in spring-data-cosmos. Appreciate your inputs. Thanks.

bhattacharyyasom avatar Aug 11 '22 17:08 bhattacharyyasom

@xinlian12 can you please take a look at this?

kushagraThapar avatar Aug 11 '22 19:08 kushagraThapar

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @kushagraThapar, @TheovanKraay

ghost avatar Aug 17 '22 17:08 ghost

@bhattacharyyasom - change feed processor support is not present in spring-data-cosmos. worth looking into our spark connector for cosmos db, which supports heavy data loading + computation and processing. Our spark connector supports change feed as well. You can find information on it here -

https://docs.microsoft.com/en-us/azure/cosmos-db/sql/sql-api-sdk-java-spark-v3

kushagraThapar avatar Aug 17 '22 20:08 kushagraThapar

@bhattacharyyasom change feed processor is definitely a good approach for this. Spark Connector is a great approach as Kushagra mentioned, but if you find that working with Dataframes does not give you the level of programmability you need for the "unit of work" you outlined above (or you prefer just Java) then recommend just using change feed processor with multiple delegates to handle processing change feed from "container a" in parallel, custom code in each delegate to handle the matching logic to container b, and use bulk api to saturate throughput when writing back to container a. Hope it helps.

TheovanKraay avatar Aug 18 '22 15:08 TheovanKraay

@TheovanKraay Thank you for the suggestions. I will try and do a short POC to try out the ideas for future. This really helps. Highly appreciate.

bhattacharyyasom avatar Aug 24 '22 11:08 bhattacharyyasom