xypath icon indicating copy to clipboard operation
xypath copied to clipboard

Memory Usage

Open scraperdragon opened this issue 12 years ago • 6 comments

This is a combined messytables/xypath issue

We need to be cautious about the amount of memory we're using:

http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip

a 1.5MB zip (15MB csv)

with

fh = dl.grab(url)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)

uses around 3 gigabytes of ram.

Given that, in the "upload a spreadsheet" tool, people could upload files this big trivially, we'll need to think about memory consumption.

Top tip: dictionaries are horrific.

Dave.

scraperdragon avatar Jul 22 '13 15:07 scraperdragon

Not significantly better with the new changes :( (40%+ ram locally; estimate ~ 2G) import StringIO import requests import xypath import messytables url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip' z = requests.get(url).content fh = StringIO.StringIO(z) mt, = list(messytables.zip.ZIPTableSet(fh).tables) xy = xypath.Table.from_messy(mt)

It's not ZIP specific.

scraperdragon avatar Sep 06 '13 13:09 scraperdragon

When making large numbers of instances of objects which only have a couple of per-instance variables, you can save a ton of memory by defining __slots__.

pwaller avatar Sep 06 '13 14:09 pwaller

__slots__ was implemented; not tested performance.

scraperdragon avatar Mar 06 '14 17:03 scraperdragon

Now 33% ram. Better, but not a vast improvement.

scraperdragon avatar Mar 20 '14 11:03 scraperdragon

More improvements, driven by a change in this file. Mostly ditching the double-index.

scraperdragon avatar Jul 09 '14 15:07 scraperdragon

This remains a problem.

Checking with the same code above (just tidied for ease of copy-pasting):

import StringIO
import requests
import xypath
import messytables

url = 'http://faostat.fao.org/Portals/_Faostat/Downloads/zip_files/FoodSupply_Crops_E_Africa_1.zip'
z = requests.get(url).content
fh = StringIO.StringIO(z)
mt, = list(messytables.zip.ZIPTableSet(fh).tables)
xy = xypath.Table.from_messy(mt)

and running it with /usr/bin/time -v python faostat.py

results in:

Maximum resident set size (kbytes): 3375120

StevenMaude avatar Sep 26 '16 10:09 StevenMaude