apify-cli icon indicating copy to clipboard operation
apify-cli copied to clipboard

cli does not recognize Python project created by crawlee

Open Pijukatel opened this issue 10 months ago • 3 comments

Currently if we use crawlee to create python Project.

crawlee create

and then within this project we call

apify init we get surprising warning: Warning: The current directory does not look like a Node.js or Python project.

It would be nice if cli could recognize crawlee projects created from templates.

For example project named bla will have following structure: Image

Pijukatel avatar Feb 12 '25 10:02 Pijukatel

This is kind of a blocker for my work on the Web scraping basics for Python devs, specifically https://github.com/apify/apify-docs/pull/1424, where I'm trying to show people how they can use the platform for their benefit.

I cannot figure out how a Python project should be structured so that apify init is happy. By reading source code, it seems that apify init doesn't take into account the existence of Python Crawlee at all. And its detection of Python projects is very naive:

  • A project is considered a valid Python project only if it contains src/__main__.py, or if it's detected as a Scrapy project https://github.com/apify/apify-cli/blob/c3cfac2fb18f5890b5eb019f5ae46d3fe32f8b00/src/lib/utils.ts#L727
  • A project is considered a valid Scrapy project only if it contains scrapy.cfg https://github.com/apify/apify-cli/blob/c3cfac2fb18f5890b5eb019f5ae46d3fe32f8b00/src/lib/projects/scrapy/ScrapyProjectAnalyzer.ts#L21

Especially the first part is insufficient:

  • There are many ways a Python project can look like and most projects I've seen in my career don't include src/__main__.py, so that won't work in most cases. I'm not even sure such structure can work at all as there's no package directory 🤔 If this structure is expected for some reason, this should be at least documented.
  • There are many ways to configure a Scrapy project. Albeit scrapy.cfg is a strong indicator that the project is made with Scrapy, it's not the only way to do it. In this case though, I wouldn't change the detection as most people will indeed have the scrapy.cfg file. If someone doesn't, I'd let them file and issue here and only then think of how we can do better.

The error message is also insufficient. As a Python dev, I do everything as always and correctly, then run apify init, as advised by various docs within the Apify universe. I'm presented by The current directory does not look like a Node.js or Python project, without any context or guidance. The message should contain URL to docs where I can read about how my project should be structured and why, what is expected to be detected, and what isn't supported.

honzajavorek avatar Mar 11 '25 10:03 honzajavorek

I just learned there are many ways to init a project:

  • apify init
  • apify create
  • crawlee create

Each doing something different. That totally wasn't clear to me previously, but I got it explained over Slack. I think apify init is still the one most suitable to my needs, as I have an existing Python project from previous lessons, which I want to actorize in the last lesson, but at least I have certain options now, so maybe I can find a workaround.

honzajavorek avatar Mar 11 '25 10:03 honzajavorek

So the structure which is expected probably comes from the templates, such as python-crawlee-beautifulsoup. Calling apify create --template=python-crawlee-beautifulsoup generates the following structure:

Image

honzajavorek avatar Mar 11 '25 12:03 honzajavorek

Parts of this issue has been fixed in recent versions, and there are plans to clean it up further in the near future! (tracked in #766 mostly)

vladfrangu avatar May 13 '25 12:05 vladfrangu