AzureStorageExplorer
AzureStorageExplorer copied to clipboard
Importing CSV to Table is slow
Preflight Checklist
- [X] I have installed the latest version of Storage Explorer.
- [X] I have checked existing resources, including the troubleshooting guide and the release notes.
- [X] I have searched for similar issues.
Storage Explorer Version
1.25.1
Regression From
1.20.0
Architecture
x64
Storage Explorer Build Number
20220809.7
Platform
Windows
OS Version
Windows 11
Bug Description
I have a simple CSV with 3 columns and 20,000 rows (PartitionKey,RowKey,Integer)
When I import using version 1.24.3 it imports about 2-3 rows per second. When I import using version 1.20.0 it imports about 500 rows per second.
Azure details:
Performance: Standard Replication: Read-access geo-redundant storage (RA-GRS) Account kind: StorageV2 (General purpose v2)
same bug as https://github.com/microsoft/AzureStorageExplorer/issues/5961
This has been marked as done in 1.25 but issue still exists
Steps to Reproduce
Import using version 1.24.3 it imports about 2-3 rows per second. Import using version 1.20.0 it imports about 500 rows per second.
Actual Experience
Import using version 1.24.3 it imports about 2-3 rows per second. Import using version 1.20.0 it imports about 500 rows per second.
Expected Experience
No response
Additional Context
No response
@Ricky-G Do you know if Storage Explorer is generating lots of network requests? Can you tell me more about the data you're importing?
We use table transactions to send entities in batches to reduce the operation time and the number of requests that need to be sent. Transactions come with a few limitations, namely:
- All entities in the transaction have to have the same partition key.
- All entities in the transaction can appear only once.
- Each transaction is limited to at most 100 entities.
Storage Explorer makes no assumption about the order of entities in the import CSV, and so it processes each row in the import CSV file in the order they appear. How your CSV file is written can affect the speed of the import operation in a number of ways:
- If most entities don't have any partition keys in common (lots of different partitions), Storage Explorer has to generate smaller, more numerous batches, which slows the operation down.
- Similarly, because rows are processed in order, if the entities in the input CSV file aren't roughly sorted by partition key, Storage Explorer will make smaller, more numerous batches.
- Having a lot of repeating entities close together in your CSV file (rows with the same partition and row key) can also cause batch fragmentation.
@Ricky-G Have you had a chance to answer any of the questions I asked a few days ago about your data? We are planning on releasing a Storage Explorer update soon, and we'd like to include an improvement to table imports if possible. We can't do that without more information from you.
@Ricky-G Something I've also noticed is that after about 500 rows, the upload speed increases from about 2-3 rows per second to 20-30 rows per second. Do you also observe that change?
We have merged some changes that should help speed up the import operation during the column scan phase. We are still looking into performance improvements for the upload phase.
The largest factor in import speed still remains the structure of the data you are uploading. As mentioned above, minimizing the number of batches that need to be sent by minimizing the number of partitions and duplicate entities should speed things up considerably.
Improving performance from there is a challenge. We do a lot to make import reliable, and that comes with trade-offs. One such problem we face is what to do when encountering the same entity (a row with the same partition and row keys as another row) more than once in a CSV file. We determined the safest approach was to break them apart into sequentially processed batches; all versions are uploaded in the order they are found. While more reliable, it means batches cannot be submitted in rapid-fire requests; we have to wait for success or failure before submitting the next batch.
@Ricky-G Since it's been some time since we've heard from you, we are going to close this item. As mentioned earlier, we've made some changes that should help improve performance. If you are still experiencing issues after considering our suggestions here, please respond with answers to our previous questions, and we can reopen the investigation.
I experience the same issue; I have quite a big CSV file of about 40k records. Which will take about 44 hours (rough estimate based on 15 records a minute. Not sure if this tool is meant for that though..
I will probably do this some other way, or by coding my own solution. But I was hoping this could be achieved pretty easily by this tool.
I'm playing around with setting the partition key to several (every 100) or all records. When I set them to all, the tool does not do anything and keeps loading at 0 imported and seems to crash eventually. At 100 I don't see any progress either.
Edit: Setting the partition key to every 10 items speeds things up a lot. 220 in 1-2 minutes. After a while (around 1500) it seems to speed up as well.
@bacpac If you have a sample file that demonstrates the performance issues, can you share that here? That can help us dig deeper and find out where we might improve. It could be that our own samples aren't sufficient to capture the issue.
Reopening this issue, because import performance is something we want to improve on.
Pervading thoughts on the problem:
- Current implementation is relatively synchronous. Components can't process characters, lines, CSV rows, table entities, or network requests separately from each other, meaning the guy at the end of the pipe always has to kick the others into gear and wait for them to finish.
- Current implementation appears to bottleneck at processing lines and CSV rows.
Avenues for investigation:
- Switch to streaming APIs to allow for more asynchronous processing.
- Process in a non-JavaScript environment (.NET or native Node).
After some in-depth analysis, I can share a few findings:
- The CSV parsing code used in 1.20.0 is roughly 10 times faster than it is in recent versions. Fortunately, this is a relatively easy thing to swap out.
- Using Node streams seems to run about twice as fast as the 1.20.0 implementation.
- Other processes in table import (entity transformations, batching, network requests), do not appear to contribute as significantly to the performance of the operation. After subtracting the time spend parsing, there's little difference in performance between 1.20.0 and recent versions.
From these findings, here's what we'll do:
- Swap out the current CSV parsing code for the Node streams implementation.
- Reevaluate performance. If further improvements are necessary, we will dive deeper into the other import processes.