python-swat
python-swat copied to clipboard
Optimizing data upload transfer size
Hi, in my project I usually upload from local very large tables with the following characteristic: all columns are numeric (double).
I saw that with upload_frame
data is first converted to csv and then the csv is uploaded on CAS.
I was thinking about the fact that in the "plain" csv every character takes 1 byte, so for instance the number 123456.123456 uses 13 bytes: but if this number is converted to its float binary representation, it would take 8 bytes: this little "gain" of 5 bytes, repeated for every row and every column may lead to a huge gain in file transfer size for bandwidth and time.
Or perhaps the "plain" csv may be gzipped to greatly reduce the file size since inside there are only a few distinct characters (i.e. the numbers).
Any Idea/hint you can give me to optimize the data upload transfer size in a smart way?
Thank you.
If you are using the binary protocol, there is a way to upload the data without going to CSV. However, this requires looping through all of the rows of data and constructing the packets. Since looping in Python is a relatively slow operation, it may not gain you anything. When uploading larger amounts of data, I typically recommend using scp / sftp to upload the data to the server and using the loadtable to load the server-side file. That is almost always a faster way to do it.
Yes I'm using binary protocol, just curious about how a packet is built.
So you won't recommend transforming the csv in something else, maybe an xlsx (which is compressed) or a (compressed) sas7bdat? Or some way to activate the gzip compression during the csv transfer?
Constructing the binary packets and sending them to the server is done using the table.addtable action, but it's fairly tricky to setup. There is a class called CASDataMsgHandler in swat/cas/datamsghandlers.py that contains all of the details. It requires a back-and-forth communication with the server to send row buffers.
As far as compressed data files goes, uploading a gzipped CSV file is probably still going to be the fastest way to go. XLSX may be compressed, but with all of the extra XML formatting in there, I would think it would be bigger than compressed CSV. If you could construct a compressed sas7bdat, that may be an option to try.