Memory Usage
This is a combined messytables/xypath issue
We need to be cautious about the amount of memory we're using:
http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip
a 1.5MB zip (15MB csv)
with
fh = dl.grab(url)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)
uses around 3 gigabytes of ram.
Given that, in the "upload a spreadsheet" tool, people could upload files this big trivially, we'll need to think about memory consumption.
Top tip: dictionaries are horrific.
Dave.
Not significantly better with the new changes :( (40%+ ram locally; estimate ~ 2G)
import StringIO import requests import xypath import messytables url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip' z = requests.get(url).content fh = StringIO.StringIO(z) mt, = list(messytables.zip.ZIPTableSet(fh).tables) xy = xypath.Table.from_messy(mt)
It's not ZIP specific.
When making large numbers of instances of objects which only have a couple of per-instance variables, you can save a ton of memory by defining __slots__.
__slots__ was implemented; not tested performance.
Now 33% ram. Better, but not a vast improvement.
More improvements, driven by a change in this file. Mostly ditching the double-index.
This remains a problem.
Checking with the same code above (just tidied for ease of copy-pasting):
import StringIO
import requests
import xypath
import messytables
url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip'
z = requests.get(url).content
fh = StringIO.StringIO(z)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)
and running it with /usr/bin/time -v python faostat.py
results in:
Maximum resident set size (kbytes): 3375120