azure-cosmosdb-spark icon indicating copy to clipboard operation
azure-cosmosdb-spark copied to clipboard

Very slow writing performance

Open SandyChapman opened this issue 4 years ago • 6 comments

Issue #318 was closed prematurely. This ticket is to have slow dataframe writes addressed. See #318 for more details.

SandyChapman avatar Jun 03 '20 12:06 SandyChapman

would it be possible to share a sample dataset, cosmos db config and the code snippet that is being used so as to look into this?

revinjchalil avatar Jun 03 '20 12:06 revinjchalil

Any updates here?

This should be an importante Issue since we can't use this library on production with this very poor performance.

I can't share my dataset, but it has less then 200 items. And it is taking minutes to upsert about 200 items.

Here is the write configuration:


writeConfig = {
        "Endpoint": "",
        "Masterkey": "",
        "Database": "",
        "Collection": ,
        "Upsert": "true"
}    

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("append").options(**writeConfig).save()

CosmosDB config:

  • Throughput : 1000
  • Partition key: /date (yyyyMMdd)
  • Consistency level: Session

dbalduini avatar Jun 23 '20 14:06 dbalduini

Something that helped me a bit was to .repartition() the data manually to the number of workers. I'm writing around 1.5M data points with WritingBatchSize = 1000 and connectionmaxpoolsize = 100. It reduced the writing time from 13 minutes to 9 minutes. And also the number of RUs consumed is much more constant. Without the .repartition() I'm seeing spikes from time to time.

moredatapls avatar Jul 28 '20 14:07 moredatapls

I ended up just using the Cosmos Python library and did a foreachPartition to leverage parallel execution on the cluster.

SandyChapman avatar Jul 28 '20 15:07 SandyChapman

  1. Spark is not a good tool to handle small data. Few minutes for a request independent of the data size is something to expect. It should scale better if you have larger data but few minutes for writing is still something you should expect
  2. For others I recommend to look at the MAX ru's setting as this by default is 4000 (i think) at that could become the bottleneck. To understand if this is the bottleneck i suggest you to look at Throttled Requests (429s) and Normalized RU Consumption (max) metrics in your Azure

AndresNamm avatar Apr 25 '22 13:04 AndresNamm