exotic-amazon icon indicating copy to clipboard operation
exotic-amazon copied to clipboard

A complete solution to crawl amazon at scale completely and accurately.

Results 30 exotic-amazon issues
Sort by recently updated
recently updated
newest added

java.nio.file.FileSystemNotFoundException: null at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getFileSystem(ZipFileSystemProvider.java:169) at jdk.zipfs/jdk.nio.zipfs.ZipFileSystemProvider.getPath(ZipFileSystemProvider.java:155) at java.base/java.nio.file.Path.of(Path.java:208) at java.base/java.nio.file.Paths.get(Paths.java:97) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.getPeriodicalSeedDirectories(AmazonGenerator.kt:61) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.generateLoadingTasks(AmazonGenerator.kt:111) at ai.platon.exotic.amazon.crawl.boot.component.AmazonGenerator.generateStartupTasks(AmazonGenerator.kt:85) at ai.platon.exotic.amazon.crawl.boot.component.AmazonCrawler.generate(AmazonCrawler.kt:53) at ai.platon.scent.crawl.AbstractRunnableCrawler.run0(AbstractRunnableCrawler.kt:49) at ai.platon.scent.crawl.AbstractRunnableCrawler.run$suspendImpl(AbstractRunnableCrawler.kt:29) at ai.platon.scent.crawl.AbstractRunnableCrawler.run(AbstractRunnableCrawler.kt) at ai.platon.scent.crawl.AbstractRunnableStreamingCrawler.run$suspendImpl(AbstractRunnableStreamingCrawler.kt:24) at ai.platon.scent.crawl.AbstractRunnableStreamingCrawler.run(AbstractRunnableStreamingCrawler.kt) at ai.platon.scent.crawl.AbstractRunnableCrawler$run$1$1.invokeSuspend(AbstractRunnableCrawler.kt:22)...

在不使用代理的情况下,`main` 分支代码可以正常运行 在使用代理的情况下,总是不能正确的获取页面(持续很长时间都没有正确的爬取页面) 爬取的日志总是( 💯 🔃 S for RR got 200 2.64 KiB

如果一个网页在获取后发生了重定向,有什么办法可以配置:extract-config.json 中的patttern 匹配 重定向后的 url 呢?

good first issue

![image](https://user-images.githubusercontent.com/72730341/204227819-e7ab3868-cb6c-49b0-9bc7-252f094d09c0.png)

good first issue

以下是错误日志: ```plain /usr/bin/google-chrome-stable --proxy-server=1.84.252.243:4231 --headless --disable-gpu --hide-scrollbars --remote-debugging-port=0 --no-default-browser-check --no-first-run --no-startup-window --mute-audio --disable-background-networking --disable-background-timer-throttling --disable-client-side-phishing-detection --disable-hang-monitor --disable-popup-blocking --disable-prompt-on-repost --disable-sync --disable-translate --disable-blink-features=AutomationControlled --metrics-recording-only --safebrowsing-disable-auto-update --no-sandbox --ignore-certificate-errors --window-size=1920,1080 --pageLoadStrategy=none --throwExceptionOnScriptError=true --user-data-dir=/tmp/pulsar-root/context/browser/br.66b305 21:17:11.427...

good first issue
wontfix

循环爬取 /prudct-reviews/... 页面的内容, 第一页爬取是正常的,当爬取到第二页的时候出了问题,爬取到的文件内容如下: 请问,这种情况应该怎么解决(未使用代理)?

good first issue
wontfix

If I want to search for "iPad" on Amazon and crawl all the search results. What should I do?

![de757c3e453c8aa75e360c311909869](https://user-images.githubusercontent.com/39584730/221144545-066f8543-b5e8-4a56-828c-4b15f8f4c3e3.png) Downloads are no longer changing at 2k. But Amazon's products are in the hundreds of millions.

good first issue
wontfix

Can this crawler crawl all consumer reviews? I only see the top-review folder, not the folder with all the reviews. ![063ac01d790830a7746e97a49823519](https://user-images.githubusercontent.com/39584730/221102001-7c11b1fb-4e57-4821-a353-8dfb0e7637f0.png)

good first issue
wontfix

How do I know how long I have to download? ![8c37ea662ba94790df95e4bf6e91273](https://user-images.githubusercontent.com/39584730/221091238-4b4ec177-5a2b-45c1-adc9-ef3cc7f50950.png)

good first issue
wontfix