slack-web-scraper Cannot scrape threads: Node is detached from document

Cannot scrape threads: Node is detached from document

Open Ephil012 opened this issue 1 year ago • 1 comments

Description

When trying to scrape a slack channel or conversation with lots of threads sometimes the scraper will throw the error Node is detached from document. After researching some, it seems this is because puppeteer will auto scroll down to the button for the thread when you call .click(). When it scrolls down, this sometimes causes Slack to re-build some DOM elements. Due to this, the nodes that were retrieved are no longer valid and cannot be clicked on, thus causing the Node is detached from document error.

How to reproduce

I was able to reproduce this error consistently by going into a DM with Slackbot (or myself) and doing the following:

First, send 8 messages each containing 5 paragraphs of Lorem Ipsum a) Reply to each of these messages in a thread so that the _ replies button shows up
Second, send 2 messages (each still with 5 paragraphs of Lorem Ipsum). a) DO NOT reply to any of these messages.
Finally, send one final message with 5 paragraphs of Lorem Ipsum a) Reply to this message in a thread

When you run the scraper, it should produce the Node is detached from document error since trying to call .click() on all the replies makes puppeteer scroll the page and then it eventually rebuilds the DOM causing the bug.

Potential Fix

I was able to fix this issue in a very very very hacky way. Essentially, we don't want to use puppeteer's .click() since according to the docs This method scrolls element into view if needed which causes the DOM to rebuild. To fix this I replaced the usages of puppeteer's native .click() with the following

await page.evaluate((element) => {
  element.click()
}, repliesButton)
await page.waitForTimeout(1000)

This essentially executes some js in the browser to call the .click() outside of puppeteer. The browser's .click() does not scroll the DOM down and thus doesn't cause reloading. The timeout is just to wait for the script to fully execute.

However, this does not fully fix the problem since when you click on a reply it opens the thread view on the side. In the default viewport size, this is enough to sometimes detach the DOM node since the screen isn't wide enough and the node goes out of view so slack disposes of it. So to fix this, I also bumped the default viewport width to 10,000 like so defaultViewport: { height: 6000, width: 10000 }. In headless mode, this allows the script to scrape without encountering the error. When you headless mode turned off, you need to manually change the window size to be wider so that when the side view for a thread opens it doesn't accidentally push the DOM node out of view.

Disclaimer: So far, this fix does seem to work, but it hasn't been fully verified. I don't know if making the viewport wider might mess with the scrolling logic in the slack scraper where it accidentally skips messages.

Alternative Fixes

I had also looked at if there was a way to just refresh the list of posts whenever the scraper scrolled from .click(). However, I suspect if you just do postHandles = await page.$$(postsSelector) then it will mess up the index in the for loop that iterates over the posts. I had tried just getting the first post only and then marking when each post was done using attributes. I noticed that the scraper currently uses .isScraped so I tried to use this, but then saw online that using data attributes (e.g. .dataset.isScraped is more reliable. So I used this to track which elements have already been visited. However, when it scrolled down it would will break this behavior by re-building the nodes and removing the attributes. So it doesn't seem like this solution would work.

Summary

While I got a fix, it is super hacky and was curious what others thought about it or if they had any better solutions to the issue. If everyone is fine with the fix I can open a PR with it included. But I wanted to get feedback first on this approach or see if someone had a better idea on how to fix it.

May 15 '23 02:05 Ephil012

That would be great if you can open a PR as draft so that we can also help to commit some changes if needed.

Jul 17 '23 09:07 notedwin-dev

slack-web-scraper slack-web-scraper copied to clipboard

Cannot scrape threads: Node is detached from document

Description

How to reproduce

Potential Fix

Alternative Fixes

Summary

slack-web-scraper
slack-web-scraper copied to clipboard