crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

fix: Update click-elements.ts - fix uncaught exception issue

Open Gilc83 opened this issue 11 months ago • 6 comments

If the history is empty because of a navigation (for example: clicking a link - < a href=...>), then this prevents an empty state to cause an exception on line 549 in the same function.

Gilc83 avatar Jan 06 '25 21:01 Gilc83

Hi @Gilc83 and thank you for your contribution! Could you please run yarn format to make sure that the code passes the lint check?

janbuchar avatar Jan 08 '25 08:01 janbuchar

few comments from my end:

  • same issue is likely present in puppeteer, lets fix that too https://github.com/apify/crawlee/blob/master/packages/puppeteer-crawler/src/internals/enqueue-links/click-elements.ts#L557-L561
  • some tests would be nice
  • PR title will end up in changelog so it needs to be something more meaningful

B4nan avatar Jan 08 '25 09:01 B4nan

Happy to help @janbuchar :) I uses npm so if you could take care of the linting that would be great. It is also important to note that this issue is a sign of a bigger problem @B4nan: When the enqueueLinksByClickingElements if called, your code is calling the internal function clickElements which go through all elements and tries to click it in the loop - for (const handle of elementHandles)...

If a navigation occurs due to one of the clicks - a link, form submission etc., the loop will fail for all remaining element handlers, saying that the context has been changed. From what I see there are 2 possible options:

  1. Identify a navigation event and stop the loop and the whole process of enqueueLinksByClickingElements in general
  2. Disable the navigation and continue clicking on other elements. ChatGPT consulted doing something like this: await page.evaluate(() => { document.querySelectorAll('a[href]').forEach((el) => { el.addEventListener('click', (event) => { event.preventDefault(); // Stop the normal navigation // Possibly open in a new tab, or handle the link differently // window.open(el.href, '_blank'); }); }); });

If you ask me for my opionion, I'm using the enqueuLinks function and then immediately the enqueueLinksByClickingElements, in order to get the maximum amounts of links in the website, so I believe the second option is better. However, the temporary fix I was adding, that prevents the whole function from crashing and the current Crawlee request to fail, is essentially the first option - the clicking is ceased. So I think my fix can be applied for the short term, but navigation disabling of any kind should be considered in the future.

PS If I got your attention I'll be happy if you could also have a look at my feature request when you have the time. https://github.com/apify/crawlee/issues/2786 I'm writing a web spider based on your awsome library and it would really help me (I did an ugly workaround for now). Thank you very much and love your work :) 🥇 💯

Gilc83 avatar Jan 08 '25 20:01 Gilc83

Happy to help @janbuchar :) I uses npm so if you could take care of the linting that would be great.

Don't worry, npm run format will work just as fine. Also, is there a chance you could write a test that fails before the fix and succeeds after it? Plus we need a better description of the issue that you fixed for the title (and changelog) :slightly_smiling_face:

It is also important to note that this issue is a sign of a bigger problem @B4nan: When the enqueueLinksByClickingElements if called, your code is calling the internal function clickElements which go through all elements and tries to click it in the loop - for (const handle of elementHandles)...

If you ask me for my opionion, I'm using the enqueuLinks function and then immediately the enqueueLinksByClickingElements, in order to get the maximum amounts of links in the website, so I believe the second option is better. However, the temporary fix I was adding, that prevents the whole function from crashing and the current Crawlee request to fail, is essentially the first option - the clicking is ceased. So I think my fix can be applied for the short term, but navigation disabling of any kind should be considered in the future.

Right, would you mind opening an issue that describes the problem and the possible solution so that we can come back to it later?

PS If I got your attention I'll be happy if you could also have a look at my feature request when you have the time. #2786 I'm writing a web spider based on your awsome library and it would really help me (I did an ugly workaround for now). Thank you very much and love your work :) 🥇 💯

Don't worry, we go through all new issues and we will eventually resolve this one, too :slightly_smiling_face:

janbuchar avatar Jan 17 '25 08:01 janbuchar

@janbuchar janbuchar I updated the pull request. Please merge :)

Gilc83 avatar Mar 26 '25 21:03 Gilc83

Please address at least the points I made here.

B4nan avatar Mar 31 '25 07:03 B4nan