crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

Add support for non-default unnamed storages

Open vdusek opened this issue 8 months ago • 2 comments

Problem

The Apify platform supports non-default unnamed storages. This functionality is also available in the Apify Python client, where you can do the following (example for dataset):

await DatasetCollectionClientAsync.get_or_create()

Each call creates a new, unnamed dataset with a unique ID.

In contrast, Crawlee does not support this (in any storage client). For example, repeated calls to:

await Dataset.open()

always return the same default unnamed storage.

Goal state

Achieve feature parity between Crawlee storages (all storage clients, including the ApifyStorageClient) and the Apify platform (API client) by adding support for non-default unnamed storages.

Possible solution

Introduce a new argument to the storage open constructor:

async def open(
    cls,
    name: str | None = None,
    id: str | None = None,
    scope: Literal['run', 'global'] = 'global',
) -> Dataset | KeyValueStore | RequestQueue:
    ...
  • scope='run' indicates a non-default unnamed storage.
  • scope='global' refers to globally named storages.
  • The name parameter cannot be entirely removed for run scope storages, as it's needed:
    • For the filesystem storage: to use as a directory name.
    • For Apify platform storage: to store the mapping of name -> ID in the default key-value store.

Behavior matrix...

Open storage by ID and name

  • Raise an exception.
  • Scope argument is ignored.

Open storage by ID

  • Opens an existing storage by ID.
  • Scope?

Open storage by name

  • Scope run:
    • Opens or creates a run-scope (non-default unnamed) storage.
      • name is used internally for reference-storage purposes but is not the actual storage's "name".
  • Scope global:
    • Opens or creates a global named storage.

Open storage without args

  • Opens the default unnamed storage.
  • Scope argument is ignored.

vdusek avatar Apr 25 '25 16:04 vdusek

When opening the storage by ID, the scope does not make sense. I think an exception would be appropriate.

janbuchar avatar Apr 25 '25 21:04 janbuchar

We should ask for more feedback, e.g. on slack

B4nan avatar May 12 '25 09:05 B4nan

So to create a new persisted unnamed dataset, you would call Dataset.open(name='debug', scope='run') and then every time you call this (even after migration), it would return the same dataset, right?

Before releasing, I would have a short sync with the platform/output schema. There is e.g. this proposal so to make sure we don't use completely different terms https://github.com/apify/actor-whitepaper/pull/25

metalwarrior665 avatar Jun 09 '25 11:06 metalwarrior665

So to create a new persisted unnamed dataset, you would call Dataset.open(name='debug', scope='run') and then every time you call this (even after migration), it would return the same dataset, right?

Yes, spot on. With the caveat that Dataset.open(name='debug') will open a different, global dataset. Perhaps we could just throw if multiple open calls share the same name but use different scopes.

Before releasing, I would have a short sync with the platform/output schema. There is e.g. this proposal so to make sure we don't use completely different terms apify/actor-whitepaper#25

The suggested implementation won't need any support from the platform side, but it's always a good idea to sync on terminology.

janbuchar avatar Jun 09 '25 12:06 janbuchar

Overview of the variants from the user experience perspective...

1) Scope version

  • Scope global remains the default option.

Direct usage

  • Same for all types of storages.
# Default dataset
default_dataset = await Dataset.open()

# Run scope dataset
dataset_run_scope = await Dataset.open(name='dataset_run_scope', scope='run')

# Global scope dataset
dataset_global_scope = await Dataset.open(name='dataset_global_scope', scope='global')

# And then all the methods remain the same…

Context helpers

  • Helpers on the crawling context.

Push data

# Default dataset
await context.push_data(data)

# Global scope dataset
await context.push_data(data, dataset_name='dataset_global_scope', scope='global')

# Run scope dataset
await context.push_data(data, dataset_name='dataset_run_scope', scope='run')

Add requests

  • Currently, there is no option for specifying the destination (always the default RQ); but we can add it.
# Default RQ
await context.add_requests(requests)

# Global scope RQ
await context.add_requests(requests, rq_name='rq_global_scope', scope='global')

# Run scope RQ
await context.add_requests(requests, rq_name='rq_run_scope', scope='run')

Enqueue links

  • Currently, there is no option for specifying the destination (always the default RQ); but we can add it.
# Default RQ
await context.enqueue_links()

# Global scope RQ
await context.enqueue_links(rq_name='rq_global_scope', scope='global')

# Run scope RQ
await context.enqueue_links(rq_name='rq_run_scope', scope='run')

Get KVS

# Default KVS
kvs = await context.get_key_value_store()

# Global scope KVS
kvs_global = await context.get_key_value_store(name='kvs_global_scope', scope='global')

# Run scope KVS
kvs_run = await context.get_key_value_store(name='kvs_run_scope', scope='run')

Use state

  • Should always use the default one.

Crawler helpers

  • Helpers on the crawlers.

Export data

# Default dataset
data = await crawler.export_data()

# Global scope dataset
data = await crawler.export_data(dataset_name='dataset_global_scope', scope='global')

# Run scope dataset
data = await crawler.export_data(dataset_name='dataset_run_scope', scope='run')

Get dataset

# Default dataset
dataset = await crawler.get_dataset()

# Global scope dataset
dataset = await crawler.get_dataset(name='dataset_global_scope', scope='global')

# Run scope dataset
dataset = await crawler.get_dataset(name='dataset_run_scope', scope='run')

Get key value store

# Default KVS
kvs = await crawler.get_key_value_store()

# Global scope KVS
kvs = await crawler.get_key_value_store(name='kvs_global_scope', scope='global')

# Run scope KVS
kvs = await crawler.get_key_value_store(name='kvs_run_scope', scope='run')

Get request manager

  • It returns the configured request manager, so it is not affected.

Add requests

  • It uses the underlying request manager, so it is not affected.

2) Alias version

Direct usage

  • Same for all types of storages.
# Default dataset
default_dataset = await Dataset.open()

# Global scope dataset (name)
dataset_global_scope = await Dataset.open(name='dataset_global_scope')

# Run scope dataset (alias)
dataset_run_scope = await Dataset.open(alias='dataset_run_scope')

# And then all the methods remain the same…

Context helpers

  • Helpers on the crawling context.

Push data

# Default dataset
await context.push_data(data)

# Global scope dataset (name)
await context.push_data(data, dataset_name='dataset_global_scope')

# Run scope dataset (alias)
await context.push_data(data, dataset_alias='dataset_run_scope')

Add requests

Currently, there is no option for specifying the destination (always the default RQ); but we can add it.

# Default RQ
await context.add_requests(requests)

# Global scope RQ (name)
await context.add_requests(requests, rq_name='rq_global_scope')

# Run scope RQ (alias)
await context.add_requests(requests, rq_alias='rq_run_scope')

Enqueue links

Currently, there is no option for specifying the destination (always the default RQ); but we can add it.

# Default RQ
await context.enqueue_links()

# Global scope RQ (name)
await context.enqueue_links(rq_name='rq_global_scope')

# Run scope RQ (alias)
await context.enqueue_links(rq_alias='rq_run_scope')

Get KVS

# Default KVS
kvs = await context.get_key_value_store()

# Global scope KVS (name)
kvs_global = await context.get_key_value_store(name='kvs_global_scope')

# Run scope KVS (alias)
kvs_run = await context.get_key_value_store(alias='kvs_run_scope')

Use state

  • Should always use the default one.

Crawler helpers

  • Helpers on the crawlers.

Export data

# Default dataset
data = await crawler.export_data()

# Global scope dataset (name)
data = await crawler.export_data(dataset_name='dataset_global_scope')

# Run scope dataset (alias)
data = await crawler.export_data(dataset_alias='dataset_run_scope')

Get dataset

# Default dataset
dataset = await crawler.get_dataset()

# Global scope dataset (name)
dataset = await crawler.get_dataset(name='dataset_global_scope')

# Run scope dataset (alias)
dataset = await crawler.get_dataset(alias='dataset_run_scope')

Get key value store

# Default KVS
kvs = await crawler.get_key_value_store()

# Global scope KVS (name)
kvs = await crawler.get_key_value_store(name='kvs_global_scope')

# Run scope KVS (alias)
kvs = await crawler.get_key_value_store(alias='kvs_run_scope')

Get request manager

  • It returns the configured request manager, so it is not affected.

Add requests

  • It uses the underlying request manager, so it is not affected.

vdusek avatar Sep 02 '25 10:09 vdusek