crawlee-python icon indicating copy to clipboard operation
crawlee-python copied to clipboard

refactor!: Introduce new storage client system

Open vdusek opened this issue 7 months ago • 1 comments

Description

  • I consolidated all commits from https://github.com/apify/crawlee-python/pull/1107 into this new PR.
  • The previous storage-clients implementation was completely replaced with a redesigned clients, including:
    • new storage-client interface,
    • in-memory storage client,
    • file-system storage client,
    • Apify storage client (implemented in the SDK; see https://github.com/apify/apify-sdk-python/pull/470),
    • and various adjustments elsewhere in the codebase.
  • The old "memory plus persist" client has been split into separate memory and file-system implementations.
    • The Configuration.persist_storage and Configuration.persist_metadata options were removed.
  • All old collection clients have been removed, they're no longer needed.
  • Each storage client now prints warnings if you pass method arguments it does not support.
  • The creation management modules in the storage clients and storages were removed.
  • Storage client parameters (e.g. purge_on_start, or token and base_api_url for the Apify client) are configured via the Configuration.
  • Every storage, and its corresponding client, now provides both a purge method (which clears all items but preserves the storage and metadata) and a drop method (which removes the entire storage, metadata included).
  • All unused types, models, and helper utilities have been removed.
  • The detailed, per-storage/client changes are listed below...

Dataset

  • Properties:
    • id
    • name
    • metadata
  • Methods:
    • open
    • purge (new method)
    • drop
    • push_data
    • get_data
    • iterate_items
    • list_items (new method)
    • export_to
  • Breaking changes:
    • from_storage_object method has been removed - Use the open method with name or id instead.
    • get_info -> metadata property
    • storage_object -> metadata property
    • set_metadata method has been removed (it wasn't propage to clients)
      • Do we want to support it (e.g. for renaming)?
    • write_to_json -> method has been removed, use export_to instead
    • write_to_csv -> method has been removed, use export_to instead
import asyncio

from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import Dataset


async def main() -> None:
    dataset = await Dataset.open(storage_client=FileSystemStorageClient())
    print(f'default dataset - ID: {dataset.id}, name: {dataset.name}')

    await dataset.push_data({'name': 'John'})
    await dataset.push_data({'name': 'John', 'age': 20})
    await dataset.push_data({})

    dataset_with_name = await Dataset.open(
        name='my_dataset',
        storage_client=FileSystemStorageClient(),
    )
    print(f'named dataset - ID: {dataset_with_name.id}, name: {dataset_with_name.name}')

    await dataset_with_name.push_data([{'age': 30}, {'age': 25}])

    print('Default dataset items:')
    async for item in dataset.iterate_items(skip_empty=True):
        print(item)

    print('Named dataset items:')
    async for item in dataset_with_name.iterate_items():
        print(item)

    items = await dataset.get_data()
    print(items)

    dataset_by_id = await Dataset.open(id=dataset_with_name.id)
    print(f'dataset by ID - ID: {dataset_by_id.id}, name: {dataset_by_id.name}')


if __name__ == '__main__':
    asyncio.run(main())

Key-value store

  • Properties:
    • id
    • name
    • metadata
  • Methods:
    • open
    • purge (new method)
    • drop
    • get_value
    • set_value
    • delete_value (new method, Apify platform's set_value support setting an empty value to a key, so having a separate method for deleting is useful)
    • iterate_keys
    • list_keys (new method)
    • get_public_url
    • get_auto_saved_value
    • persist_autosaved_values
  • Breaking changes:
    • from_storage_object method has been removed - Use the open method with name or id instead.
    • get_info -> metadata property
    • storage_object -> metadata property
    • set_metadata method has been removed (it wasn't propage to clients)
      • Do we want to support it (e.g. for renaming)?
import asyncio

import requests

from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import KeyValueStore


async def main() -> None:
    print('Opening key-value store "my_kvs"...')
    storage_client = FileSystemStorageClient()
    kvs = await KeyValueStore.open(name='my_kvs', storage_client=storage_client)

    print('Setting value to "file.json"...')
    await kvs.set_value('file.json', {'key': 'value'})

    print('Setting value to "file.jpg"...')
    response = requests.get('https://avatars.githubusercontent.com/u/25082181')
    await kvs.set_value('file.jpg', response.content)

    print('Iterating over keys:')
    async for key in kvs.iterate_keys():
        print(f'Key: {key}')

    print('Listing keys:')
    keys = [key.key for key in await kvs.list_keys()]
    print(f'Keys: {keys}')

    for key in keys:
        print(f'Getting value of {key}...')
        value = await kvs.get_value(key)
        print(f'Value: {str(value)[:100]}')

    print('Deleting value of "file.json"...')
    await kvs.delete_value('file.json')

    kvs_default = await KeyValueStore.open(storage_client=storage_client)

    special_key = 'key with spaces/and/slashes!@#$%^&*()'
    test_value = 'Special key value'

    await kvs_default.set_value(key=special_key, value=test_value)

    record = await kvs_default.get_value(key=special_key)
    assert record is not None
    assert record == test_value

    result = await kvs_default.list_keys()
    print(f'kvs_default list keys = {result}')

    kvs_2 = await KeyValueStore.open()
    result = await kvs_2.list_keys()
    print(f'kvs_2 list keys = {result}')


if __name__ == '__main__':
    asyncio.run(main())

Request queue

  • Properties:
    • id
    • name
    • metadata
  • Methods:
    • open
    • purge (new method)
    • drop
    • add_request
    • add_requests_batched -> add_requests
    • fetch_next_request
    • get_request
    • mark_request_as_handled
    • reclaim_request
    • is_empty
    • is_finished
  • Breaking changes:
    • from_storage_object method has been removed - Use the open method with name or id instead.
    • get_info -> metadata property
    • storage_object -> metadata property
    • set_metadata method has been removed (it wasn't propage to clients)
      • Do we want to support it (e.g. for renaming)?
    • get_handled_count method had been removed - Use metadata.handled_request_count instead.
    • get_total_count method has been removed - Use metadata.total_request_count instead.
    • resource_directory from the RequestQueueMetadata was removed, use path_to... property instead.
    • RequestQueueHead model has been removed - Use RequestQueueHeadWithLocks instead.
  • Notes:
    • New RQ add_requests contain forefront arg (Apify API supports it)
import asyncio

from crawlee import Request
from crawlee.configuration import Configuration
from crawlee.storage_clients import FileSystemStorageClient
from crawlee.storages import RequestQueue


async def main() -> None:
    rq = await RequestQueue.open(
        name='my-queue',
        storage_client=FileSystemStorageClient(),
        configuration=Configuration(purge_on_start=True),
    )

    print(f'RequestQueue: {rq}')
    print(f'RequestQueue client: {rq._client}')

    await rq.add_requests(
        requests=[
            Request.from_url('https://example.com', use_extended_unique_key=True),
            Request.from_url('https://crawlee.dev', use_extended_unique_key=True),
            Request.from_url('https://apify.com', use_extended_unique_key=True),
        ],
    )

    print('Requests were added to the queue')

    is_empty = await rq.is_empty()
    is_finished = await rq.is_finished()

    print(f'Is empty: {is_empty}')
    print(f'Is finished: {is_finished}')

    request = await rq.fetch_next_request()
    print(f'Fetched request: {request}')

    await rq.add_request('https://facebook.com', forefront=True)

    request = await rq.fetch_next_request()
    print(f'Fetched request: {request}')

    rq_default = await RequestQueue.open(
        storage_client=FileSystemStorageClient(),
        configuration=Configuration(purge_on_start=True),
    )

    await rq_default.add_request('https://example.com/1')
    await rq_default.add_requests(
        [
            'https://example.com/priority-1',
            'https://example.com/priority-2',
            'https://example.com/priority-3',
        ]
    )
    await rq_default.add_request('https://example.com/2')


if __name__ == '__main__':
    asyncio.run(main())

BaseDatasetClient

  • Properties:
    • metadata
  • Methods:
    • open
    • purge
    • drop
    • push_data
    • get_data
    • iterate_items

BaseKeyValueStoreClient

  • Properties:
    • metadata
  • Methods:
    • open
    • purge
    • drop
    • get_value
    • set_value
    • delete_value
    • iterate_keys
    • get_public_url

BaseRequestQueueClient

  • Properties:
    • metadata
  • Methods:
    • open
    • purge
    • drop
    • add_requests_batch -> add_batch_of_requests (one backend method for 2 frontend methods)
    • get_request
    • fetch_next_request
    • mark_request_as_handled
    • reclaim_request
    • is_empty
  • Models
    • RequestQueueHeadWithLocks -> RequestQueueHead
    • BatchRequestsOperationResponse -> AddRequestsResponse
  • Notes:
    • Old file system (memory) version didn't persist the in-progress requests
    • Old file system (memory) version didn't persist the forefront values (now there is a FS-specific _sequence field in the FS Request)
    • The methods manipulating locks and listing heads are now only internal methods of Apify RQ client.

Issues

  • Closes: #92
  • Closes: #147
  • Closes: #783
  • Relates: #1175
  • Relates: #1191

Testing

  • The original tests were mostly removed and replaced with a new ones.
  • Each storage-client implementation now has its own dedicated tests at the client level (more targeted/edge-case coverage).
  • On top of that, there are storage-level tests that use a parametrized fixture for each storage client (file-system and memory), ensuring every storage test runs against every client implementation.

Checklist

  • [x] CI passed

vdusek avatar May 10 '25 08:05 vdusek

That's excellent work!

Mantisus avatar May 16 '25 21:05 Mantisus

All the feedback was addressed, including the upgrading guide. Could you guys please take a second look?

(FYI; this https://github.com/apify/crawlee-python/pull/1194/commits/9f10b955c6d8c0d82940ad1c6bec0be6d5274565 broke the docs build, @barjin will take a look at it later.)

vdusek avatar Jun 11 '25 12:06 vdusek

Awesome work @vdusek! Did you get a chance to test the SDK integration tests (with these changes https://github.com/apify/apify-sdk-python/pull/470) with updated Crawlee?

I'm asking because I'd love to avoid the situation where we need to make hotfix releases after we discover that we can't make SDK work with these changes.

janbuchar avatar Jun 16 '25 16:06 janbuchar

Benchmark

  • 1000 requests to a local HTTP server.

Crawlee Py - Old memory client

  • Old memory client = memory with persitence false

All runtimes:

  • Run 1: 4.280095s
  • Run 2: 4.388267s
  • Run 3: 4.322075s
  • Run 4: 4.520383s
  • Run 5: 4.222991s
  • Run 6: 4.359523s
  • Run 7: 4.182811s
  • Run 8: 4.322229s
  • Run 9: 4.072315s
  • Run 10: 4.032425s

Average crawler runtime: 4.270311s


Crawlee Py - Old file-system client

  • Old file-system client = memory with persistence true

All runtimes:

  • Run 1: 5.351646s
  • Run 2: 5.934761s
  • Run 3: 5.218038s
  • Run 4: 4.867312s
  • Run 5: 4.890084s
  • Run 6: 4.935311s
  • Run 7: 4.923271s
  • Run 8: 4.752518s
  • Run 9: 4.724725s
  • Run 10: 4.865203s

Average crawler runtime: 5.046287s


Crawlee Py - New memory client

All runtimes:

  • Run 1: 1.582967s
  • Run 2: 1.723083s
  • Run 3: 1.539048s
  • Run 4: 1.622284s
  • Run 5: 1.802081s
  • Run 6: 1.556861s
  • Run 7: 1.436224s
  • Run 8: 1.635982s
  • Run 9: 1.633467s
  • Run 10: 1.727041s

Average crawler runtime: 1.625904s


Crawlee Py - New file-system client

All runtimes:

  • Run 1: 4.299179s
  • Run 2: 4.576746s
  • Run 3: 4.359626s
  • Run 4: 4.305971s
  • Run 5: 4.480797s
  • Run 6: 4.511054s
  • Run 7: 4.316566s
  • Run 8: 4.503595s
  • Run 9: 4.378998s
  • Run 10: 4.427795s

Average crawler runtime: 4.416033s


Crawlee TS - Memory client

  • Memory client = memory with persistence false

All runtimes:

  • Run 1: 2.268000s
  • Run 2: 2.220000s
  • Run 3: 2.254000s
  • Run 4: 2.297000s
  • Run 5: 2.283000s
  • Run 6: 2.279000s
  • Run 7: 2.212000s
  • Run 8: 2.307000s
  • Run 9: 2.127000s
  • Run 10: 2.256000s

Average crawler runtime: 2.2503s


Crawlee TS - File-system client

  • File-system client = memory with persistence true

All runtimes:

  • Run 1: 3.446000s
  • Run 2: 3.186000s
  • Run 3: 3.390000s
  • Run 4: 3.145000s
  • Run 5: 3.102000s
  • Run 6: 3.260000s
  • Run 7: 3.178000s
  • Run 8: 3.160000s
  • Run 9: 3.369000s
  • Run 10: 3.247000s

Average crawler runtime: 3.2483s


Scrapy - memory*

  • Scrapy provides only in-memory storage for requests.

All runtimes:

  • Run 1: 1.476767s
  • Run 2: 1.451156s
  • Run 3: 1.463033s
  • Run 4: 1.521124s
  • Run 5: 1.498111s
  • Run 6: 1.494774s
  • Run 7: 1.487637s
  • Run 8: 1.479461s
  • Run 9: 1.459569s
  • Run 10: 1.447364s

Average crawler runtime: 1.477900s


Summary

Configuration Average Runtime (s)
Crawlee Py - Old memory client 4.270311
Crawlee Py - Old file-system client 5.046287
Crawlee Py - New memory client 1.625904
Crawlee Py - New file-system client 4.416033
Crawlee TS - Memory client 2.2503
Crawlee TS - File-system client 3.2483
Scrapy - memory* 1.477900

vdusek avatar Jun 20 '25 13:06 vdusek

end of an era 🎉

B4nan avatar Jul 01 '25 14:07 B4nan