AzureStorageExplorer icon indicating copy to clipboard operation
AzureStorageExplorer copied to clipboard

Importing CSV to Table is slow

Open Ricky-G opened this issue 3 years ago • 4 comments

Preflight Checklist

Storage Explorer Version

1.25.1

Regression From

1.20.0

Architecture

x64

Storage Explorer Build Number

20220809.7

Platform

Windows

OS Version

Windows 11

Bug Description

I have a simple CSV with 3 columns and 20,000 rows (PartitionKey,RowKey,Integer)

When I import using version 1.24.3 it imports about 2-3 rows per second. When I import using version 1.20.0 it imports about 500 rows per second.

Azure details:

Performance: Standard Replication: Read-access geo-redundant storage (RA-GRS) Account kind: StorageV2 (General purpose v2)

same bug as https://github.com/microsoft/AzureStorageExplorer/issues/5961

This has been marked as done in 1.25 but issue still exists

Steps to Reproduce

Import using version 1.24.3 it imports about 2-3 rows per second. Import using version 1.20.0 it imports about 500 rows per second.

Actual Experience

Import using version 1.24.3 it imports about 2-3 rows per second. Import using version 1.20.0 it imports about 500 rows per second.

Expected Experience

No response

Additional Context

No response

Ricky-G avatar Sep 15 '22 05:09 Ricky-G

@Ricky-G Do you know if Storage Explorer is generating lots of network requests? Can you tell me more about the data you're importing?

We use table transactions to send entities in batches to reduce the operation time and the number of requests that need to be sent. Transactions come with a few limitations, namely:

  • All entities in the transaction have to have the same partition key.
  • All entities in the transaction can appear only once.
  • Each transaction is limited to at most 100 entities.

Storage Explorer makes no assumption about the order of entities in the import CSV, and so it processes each row in the import CSV file in the order they appear. How your CSV file is written can affect the speed of the import operation in a number of ways:

  • If most entities don't have any partition keys in common (lots of different partitions), Storage Explorer has to generate smaller, more numerous batches, which slows the operation down.
  • Similarly, because rows are processed in order, if the entities in the input CSV file aren't roughly sorted by partition key, Storage Explorer will make smaller, more numerous batches.
  • Having a lot of repeating entities close together in your CSV file (rows with the same partition and row key) can also cause batch fragmentation.

craxal avatar Sep 20 '22 01:09 craxal

@Ricky-G Have you had a chance to answer any of the questions I asked a few days ago about your data? We are planning on releasing a Storage Explorer update soon, and we'd like to include an improvement to table imports if possible. We can't do that without more information from you.

craxal avatar Sep 23 '22 21:09 craxal

@Ricky-G Something I've also noticed is that after about 500 rows, the upload speed increases from about 2-3 rows per second to 20-30 rows per second. Do you also observe that change?

craxal avatar Sep 27 '22 19:09 craxal

We have merged some changes that should help speed up the import operation during the column scan phase. We are still looking into performance improvements for the upload phase.

The largest factor in import speed still remains the structure of the data you are uploading. As mentioned above, minimizing the number of batches that need to be sent by minimizing the number of partitions and duplicate entities should speed things up considerably.

Improving performance from there is a challenge. We do a lot to make import reliable, and that comes with trade-offs. One such problem we face is what to do when encountering the same entity (a row with the same partition and row keys as another row) more than once in a CSV file. We determined the safest approach was to break them apart into sequentially processed batches; all versions are uploaded in the order they are found. While more reliable, it means batches cannot be submitted in rapid-fire requests; we have to wait for success or failure before submitting the next batch.

craxal avatar Sep 28 '22 17:09 craxal

@Ricky-G Since it's been some time since we've heard from you, we are going to close this item. As mentioned earlier, we've made some changes that should help improve performance. If you are still experiencing issues after considering our suggestions here, please respond with answers to our previous questions, and we can reopen the investigation.

craxal avatar Nov 16 '22 22:11 craxal

I experience the same issue; I have quite a big CSV file of about 40k records. Which will take about 44 hours (rough estimate based on 15 records a minute. Not sure if this tool is meant for that though..

I will probably do this some other way, or by coding my own solution. But I was hoping this could be achieved pretty easily by this tool.

I'm playing around with setting the partition key to several (every 100) or all records. When I set them to all, the tool does not do anything and keeps loading at 0 imported and seems to crash eventually. At 100 I don't see any progress either.

Edit: Setting the partition key to every 10 items speeds things up a lot. 220 in 1-2 minutes. After a while (around 1500) it seems to speed up as well.

bacpac avatar Dec 19 '22 22:12 bacpac

@bacpac If you have a sample file that demonstrates the performance issues, can you share that here? That can help us dig deeper and find out where we might improve. It could be that our own samples aren't sufficient to capture the issue.

craxal avatar Dec 20 '22 02:12 craxal

Reopening this issue, because import performance is something we want to improve on.

Pervading thoughts on the problem:

  • Current implementation is relatively synchronous. Components can't process characters, lines, CSV rows, table entities, or network requests separately from each other, meaning the guy at the end of the pipe always has to kick the others into gear and wait for them to finish.
  • Current implementation appears to bottleneck at processing lines and CSV rows.

Avenues for investigation:

  • Switch to streaming APIs to allow for more asynchronous processing.
  • Process in a non-JavaScript environment (.NET or native Node).

craxal avatar Mar 24 '23 22:03 craxal

After some in-depth analysis, I can share a few findings:

  • The CSV parsing code used in 1.20.0 is roughly 10 times faster than it is in recent versions. Fortunately, this is a relatively easy thing to swap out.
  • Using Node streams seems to run about twice as fast as the 1.20.0 implementation.
  • Other processes in table import (entity transformations, batching, network requests), do not appear to contribute as significantly to the performance of the operation. After subtracting the time spend parsing, there's little difference in performance between 1.20.0 and recent versions.

From these findings, here's what we'll do:

  1. Swap out the current CSV parsing code for the Node streams implementation.
  2. Reevaluate performance. If further improvements are necessary, we will dive deeper into the other import processes.

craxal avatar Jun 20 '23 22:06 craxal