azure-cosmosdb-spark Very slow writing performance

Issue #318 was closed prematurely. This ticket is to have slow dataframe writes addressed. See #318 for more details.

Jun 03 '20 12:06 SandyChapman

would it be possible to share a sample dataset, cosmos db config and the code snippet that is being used so as to look into this?

Jun 03 '20 12:06 revinjchalil

Any updates here?

This should be an importante Issue since we can't use this library on production with this very poor performance.

I can't share my dataset, but it has less then 200 items. And it is taking minutes to upsert about 200 items.

Here is the write configuration:


writeConfig = {
        "Endpoint": "",
        "Masterkey": "",
        "Database": "",
        "Collection": ,
        "Upsert": "true"
}    

df.write.format("com.microsoft.azure.cosmosdb.spark").mode("append").options(**writeConfig).save()

CosmosDB config:

Throughput : 1000
Partition key: /date (yyyyMMdd)
Consistency level: Session

Jun 23 '20 14:06 dbalduini

Something that helped me a bit was to .repartition() the data manually to the number of workers. I'm writing around 1.5M data points with WritingBatchSize = 1000 and connectionmaxpoolsize = 100. It reduced the writing time from 13 minutes to 9 minutes. And also the number of RUs consumed is much more constant. Without the .repartition() I'm seeing spikes from time to time.

Jul 28 '20 14:07 moredatapls

I ended up just using the Cosmos Python library and did a foreachPartition to leverage parallel execution on the cluster.

Jul 28 '20 15:07 SandyChapman

Spark is not a good tool to handle small data. Few minutes for a request independent of the data size is something to expect. It should scale better if you have larger data but few minutes for writing is still something you should expect
For others I recommend to look at the MAX ru's setting as this by default is 4000 (i think) at that could become the bottleneck. To understand if this is the bottleneck i suggest you to look at Throttled Requests (429s) and Normalized RU Consumption (max) metrics in your Azure

Apr 25 '22 13:04 AndresNamm

azure-cosmosdb-spark azure-cosmosdb-spark copied to clipboard

Very slow writing performance

azure-cosmosdb-spark
azure-cosmosdb-spark copied to clipboard