python-swat icon indicating copy to clipboard operation
python-swat copied to clipboard

Optimizing data upload transfer size

Open gygabyte017 opened this issue 4 years ago • 3 comments

Hi, in my project I usually upload from local very large tables with the following characteristic: all columns are numeric (double).

I saw that with upload_frame data is first converted to csv and then the csv is uploaded on CAS.

I was thinking about the fact that in the "plain" csv every character takes 1 byte, so for instance the number 123456.123456 uses 13 bytes: but if this number is converted to its float binary representation, it would take 8 bytes: this little "gain" of 5 bytes, repeated for every row and every column may lead to a huge gain in file transfer size for bandwidth and time.

Or perhaps the "plain" csv may be gzipped to greatly reduce the file size since inside there are only a few distinct characters (i.e. the numbers).

Any Idea/hint you can give me to optimize the data upload transfer size in a smart way?

Thank you.

gygabyte017 avatar Jul 03 '20 13:07 gygabyte017

If you are using the binary protocol, there is a way to upload the data without going to CSV. However, this requires looping through all of the rows of data and constructing the packets. Since looping in Python is a relatively slow operation, it may not gain you anything. When uploading larger amounts of data, I typically recommend using scp / sftp to upload the data to the server and using the loadtable to load the server-side file. That is almost always a faster way to do it.

kesmit13 avatar Jul 03 '20 17:07 kesmit13

Yes I'm using binary protocol, just curious about how a packet is built.

So you won't recommend transforming the csv in something else, maybe an xlsx (which is compressed) or a (compressed) sas7bdat? Or some way to activate the gzip compression during the csv transfer?

gygabyte017 avatar Jul 06 '20 08:07 gygabyte017

Constructing the binary packets and sending them to the server is done using the table.addtable action, but it's fairly tricky to setup. There is a class called CASDataMsgHandler in swat/cas/datamsghandlers.py that contains all of the details. It requires a back-and-forth communication with the server to send row buffers.

As far as compressed data files goes, uploading a gzipped CSV file is probably still going to be the fastest way to go. XLSX may be compressed, but with all of the extra XML formatting in there, I would think it would be bigger than compressed CSV. If you could construct a compressed sas7bdat, that may be an option to try.

kesmit13 avatar Jul 06 '20 13:07 kesmit13