archivy Run long tasks in background

Run long tasks in background

Open clemux opened this issue 3 years ago • 11 comments

Some of the work archivy does should run asynchronously from the web server:

importing links from external services (currently pocket, more in the future)
retrieving a web page's content

Some possible solutions:

Celery

This would add a dependency to either rabbitmq or redis, which might not match your vision for archivy as a simple app. On the other hand, redis might be useful for other stuff (like a cache for a text-search system easier to install than ES)

Python-RQ

(much) More minimalist design, using redis: official website

Aug 26 '20 16:08 clemux

Addendum, with celery it would be easy to make the rabbitmq/redis dependency optional and run tasks in the flask process when it is not available.

Aug 26 '20 18:08 clemux

I'd rather not have to use redis, do you know of any more lightweight alternatives? Maybe we could use threads... :thinking:

Aug 26 '20 19:08 Uzay-G

Agree with keeping it simple for now, we could also use Python3's built-in asyncio eg.

https://guillotina.readthedocs.io/en/latest/training/asyncio.html#long-running-tasks
https://faculty.ai/blog/a-guide-to-using-asyncio/

Aug 26 '20 21:08 cktang88

Yes I think using asyncio is a good idea

Aug 26 '20 22:08 Uzay-G

I agree!

Aug 26 '20 22:08 clemux

It seems that flask and werkzeug don't play nice with asyncio, because werkzeug is blocking by design.

https://pgjones.dev/blog/flask-async-quart-sync-2019/
https://github.com/pallets/werkzeug/issues/1322

That doesn't mean we cannot use custom coroutines/threading/multiprocessing, though.

Aug 27 '20 14:08 clemux

https://docs.python.org/3/library/concurrent.futures.html#module-concurrent.futures seems to provide what we need here, any opinions?

https://flask-executor.readthedocs.io/en/latest/ shows how it could be implemented (not necessarily by using that extension directly)

Aug 27 '20 15:08 clemux

We're about to do this refactor on ArchiveBox too, we're looking at theses 2 queue systems primarily:

https://github.com/coleifer/huey (supports SQLite as the backing store)
https://github.com/Bogdanp/dramatiq (requires Redis/RabbitMQ)

There are also adapters that link them to Flask I think (I know there are adapters for Django, should be easy to adapt if no flask-specific ones). The reason we didn't end up going with asyncio is because it's still singlethreaded, and there's a decent amount of blocking python that still needs to be run while archiving each link. Archivy's architecture / access patterns may be different though, idk.

I'm rooting hard for Archivy, it looks like you've managed to avoid a lot of the early mistakes that plagued the ArchiveBox codebase, the UI is gorgeous, and your plugin system is awesome. I'd love to share notes/lessons we learned from ours so that you can avoid those pitfalls.

Feb 02 '21 16:02 pirate

Thanks for the suggestions!

I'm rooting hard for Archivy, it looks like you've managed to avoid a lot of the early mistakes that plagued the ArchiveBox codebase, the UI is gorgeous, and your plugin system is awesome. I'd love to share notes/lessons we learned from ours so that you can avoid those pitfalls.

Yes I'd definitely loved to collaborate, and I remember your comments on the post I made about Archivy on Hacker News, back in August.

Your work with ArchiveBox is really cool :)

Feb 02 '21 17:02 Uzay-G

Ah yeah sorry I forgot to follow up after I initially commented on HN, got swamped with work. I'll join your discord and we can continue the convo there :)

Feb 02 '21 19:02 pirate

Cool!

Feb 02 '21 20:02 Uzay-G

archivy archivy copied to clipboard

Run long tasks in background

Celery

Python-RQ

archivy
archivy copied to clipboard