exotic-amazon
exotic-amazon copied to clipboard
A complete solution to crawl amazon at scale completely and accurately.
java.nio.file.FileSystemNotFoundException: null at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getFileSystem(ZipFileSystemProvider.java:169) at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getPath(ZipFileSystemProvider.java:155) at java.base/java.nio.file.Path.of(Path.java:208) at java.base/java.nio.file.Paths.get(Paths.java:97) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.getPeriodicalSeedDirectories(AmazonGenerator.kt:61) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.generateLoadingTasks(AmazonGenerator.kt:111) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.generateStartupTasks(AmazonGenerator.kt:85) at ai.platon.exotic.amazon.crawl.boot.component.AmazonCrawler.generate(AmazonCrawler.kt:53) at ai.platon.scent.crawl.AbstractRunnableCrawler.run0(AbstractRunnableCrawler.kt:49) at ai.platon.scent.crawl.AbstractRunnableCrawler.run$suspendImpl(AbstractRunnableCrawler.kt:29) at ai.platon.scent.crawl.AbstractRunnableCrawler.run(AbstractRunnableCrawler.kt) at ai.platon.scent.crawl.AbstractRunnableStreamingCrawler.run$suspendImpl(AbstractRunnableStreamingCrawler.kt:24) at ai.platon.scent.crawl.AbstractRunnableStreamingCrawler.run(AbstractRunnableStreamingCrawler.kt) at ai.platon.scent.crawl.AbstractRunnableCrawler$run$1$1.invokeSuspend(AbstractRunnableCrawler.kt:22)...
在不使用代理的情况下,`main` 分支代码可以正常运行 在使用代理的情况下,总是不能正确的获取页面(持续很长时间都没有正确的爬取页面) 爬取的日志总是( 💯 🔃 S for RR got 200 2.64 KiB
如果一个网页在获取后发生了重定向,有什么办法可以配置:extract-config.json 中的patttern 匹配 重定向后的 url 呢?

以下是错误日志: ```plain /usr/bin/google-chrome-stable --proxy-server=1.84.252.243:4231 --headless --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/tmp/pulsar-root/context/browser/br.66b305 21:17:11.427...
循环爬取 /prudct-reviews/... 页面的内容, 第一页爬取是正常的,当爬取到第二页的时候出了问题,爬取到的文件内容如下: 请问,这种情况应该怎么解决(未使用代理)?
If I want to search for "iPad" on Amazon and crawl all the search results. What should I do?
 Downloads are no longer changing at 2k. But Amazon's products are in the hundreds of millions.
Can this crawler crawl all consumer reviews? I only see the top-review folder, not the folder with all the reviews. 
How do I know how long I have to download? 