pystac icon indicating copy to clipboard operation
pystac copied to clipboard

Collections with many items saving time issue

Open santilland opened this issue 2 years ago • 14 comments

Using pystac[validation] 1.8.3

I am creating collections with a larger amount of items and was surprised by the time it took to save them. I have been doing some very preliminary tests and it somehow seems that the save time increases exponentially with the amount of items in a collection. For example saving a catalog with 1 collection takes depending on item count:

Items Time
200 0.225s
2000 5.439s
10000 105.975s

If i create 5 collections with 2000 items the saving time is 25s. So the same amount of items are being saved in total but it takes 4 times less when separated into multiple collections.

Any ideas why this could be happening?

Here is a very rough testing script:


import time
from datetime import (
    datetime,
    timedelta,
)
from pystac import (
    Item,
    Catalog,
    CatalogType,
    Collection,
    Extent,
    SpatialExtent,
    TemporalExtent,
)
from pystac.layout import TemplateLayoutStrategy

numdays = 10000
number_of_collections = 1
base = datetime.today()
times = [base - timedelta(days=x) for x in range(numdays)]

catalog = Catalog(
    id = "test",
    description = "catalog to test performance",
    title = "performance test catalog",
    catalog_type=CatalogType.RELATIVE_PUBLISHED,
)

spatial_extent = SpatialExtent([
    [-180.0, -90.0, 180.0, 90.0],
])
temporal_extent = TemporalExtent([[datetime.now()]])
extent = Extent(spatial=spatial_extent, temporal=temporal_extent)


for idx in range(number_of_collections):
    collection = Collection(
        id="big_collection%s"%idx,
        title="collection for items",
        description="some desc",
        extent=extent
    )
    for t in times:
        item = Item(
            id = t.isoformat(),
            bbox=[-180.0, -90.0, 180.0, 90.0],
            properties={},
            geometry = None,
            datetime = t,
        )
        collection.add_item(item)
    collection.update_extent_from_items()

    catalog.add_child(collection)

strategy = TemplateLayoutStrategy(item_template="${collection}/${year}")
catalog.normalize_hrefs("https://exampleurl.com/", strategy=strategy)

start_time = time.perf_counter()
catalog.save(dest_href="../test_build/")
end_time = time.perf_counter()
print(f"Saving Time : {end_time - start_time:0.6f}" )

santilland avatar Aug 17 '23 13:08 santilland

My timings on pystac (master branch) are as follows:

Collections Items Time
1 200 1.247s
1 2000 12.880s
1 10000 70.173s
5 2000 61.693s

It's weird that the times look reasonable (nearly linear) for my tests, but not for you. The difference is a different machine, installing validation requirements and installing from pypi, while I'm on master.

m-mohr avatar Aug 17 '23 14:08 m-mohr

Strange, i was getting the same slow saving on the github action workflow (which uses ubuntu-latest). My machine is also ubuntu, what OS are you on? Maybe some file handler issue?

santilland avatar Aug 17 '23 14:08 santilland

I'm running Ubuntu 22.04.2 LTS through WSL.

I also just ran pystac 1.8.3 installed from pypi with validation requirements and the timings are similar to the ones I reported above. Weird...

m-mohr avatar Aug 17 '23 14:08 m-mohr

Thanks for the report, and thanks @m-mohr for also taking a look. My timings (macos) working from main w/ Python 3.11:

Collections Items Time
1 200 0.67s
1 2000 9.36s
1 10000 102.65s
5 2000 45.48s

So I'm seeing performance similar to @santilland.

I will note that there's some known issues around serializing a STACObject to a dictionary, specifically around link resolution and the network requests it can fire off: https://github.com/stac-utils/pystac/issues/960. This could be part of the issue, though I don't quite know how yet.

gadomski avatar Aug 17 '23 14:08 gadomski

Weird. We were just thinking maybe it's an issue with memory consumption / swapping? My machine is not at max (one of the CPUs is close, but memory is not an issue at all due to 128GB), but @santilland is maxing out both with regards to memory and one of the CPUs.

My timings are from Python 3.10.6

m-mohr avatar Aug 17 '23 14:08 m-mohr

Yeah, I'm seeing my run get CPU bound, but not memory bound. 🤔

gadomski avatar Aug 17 '23 14:08 gadomski

I think @m-mohr commented on possible swap issue because my machine in general is close to the edge already but i did not see a special increase in memory use. But one core is always at 100% while saving.

santilland avatar Aug 17 '23 14:08 santilland

Yeah, maybe it's something else. I'm on WSL and not on native Ubuntu, so maybe some kind of IO thing? Or something completely different :-) I guess profiling on an affected machine may give some insights...

m-mohr avatar Aug 17 '23 14:08 m-mohr