PlatonAI

Results 89 comments of PlatonAI

一个网页在浏览器中发生重定向,不影响 WebPage.url,只是会将 WebPage.location 设置为最终浏览器中显示的链接。因此不应该出现你上面提及的情况。 你应当在采集该网页之前完成链接规范化,该规范化之后的链接,应该成为该网页唯一合法的“统一资源定位符”,譬如一个产品页面可能会以以下方式出现: ``` 1. https://www.amazon.co.uk/dp/B0BS3ZRCCW?th=1 2. https://www.amazon.co.uk/4pcs-Wheel-Centre-Caps-Replacement/dp/B0BS3ZRCCW/ref=zg-bsnr_automotive_sccl_3/258-4903014-4534368?pd_rd_w=BHv9M&content-id=amzn1.sym.401f1a3a-5fa9-46fb-9ed2-7c7d241a11cd&pf_rd_p=401f1a3a-5fa9-46fb-9ed2-7c7d241a11cd&pf_rd_r=2YQWPCKBZ3AQNX97MTH2&pd_rd_wg=moQWE&pd_rd_r=a80bc1d3-fc5b-4b1b-93ce-4f4f77532037&pd_rd_i=B0BS3ZRCCW&psc=1 3. https://www.amazon.co.uk/4pcs-Wheel-Centre-Caps-Replacement/dp/B0BS3ZRCCW/ref=zg-bsnr_automotive_sccl_3 ``` 但是你应该将所有非标准形式统一成标准形式: `https://www.amazon.co.uk/dp/B0BS3ZRCCW` 后面无论它怎么重定向,或者增加参数,标准形式是你在系统中唯一合法的URL,其他形式的URL只能作为参考。不建议基于 WebPage.location 做判断来执行某个规则。

没有这样的逻辑,可以匹配重定向之后的 url。 1. 如果每一个特定模式的输入链接都会跳转到另一个固定模式的链接,那么直接在输入链接上做匹配即可 2. 如果 1 不成立,那么你需要手写一些代码处理这类逻辑。你可以将待采集链接创建为一个 ListenableHyperlink, 注册 onHTMLDocumentParsed 的事件处理器,在这个事件处理器中执行 X-SQL 来提取字段。 相关链接: 1. [PulsarR 系列课程 7 - 事件处理](https://zhuanlan.zhihu.com/p/576071511) 2. [EventHandler Example](https://github.com/platonai/pulsarr/blob/master/pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/_6_EventHandler.kt) 3. [综合运用事件处理器和 X-SQL - PulsarR 的...

Mongodb has to be started, it is used to store the metadata, and it's really simple to use. JdbCommitter is used to save the extract result to a JDBC compatible...

JDBC config file config/jdbc-sink-config.json has been deprecated, you should set up JdbCommitter programmatically. You can use schema-dump.sql to create your MySQL tables. The config file extract-config.json is a mapping from...

Running with the standalone jar is not supported currently. Please try running the program as described in the README.

It said very clear: no enough memory. `01:15:14.239 [r-worker-1] INFO a.p.p.crawl.impl.StreamingCrawler - 321. numRunning: 0, availableMemory: 93.46 MiB, memoryToReserve: 1.00 GiB, shortage: -975736832 B`

Exotic-amazon is not a toy program but a real world solution to crawl one of the biggest website in the world, completely and accurately, so you'd better use a better...

It is very strange to see the following log if your computer really has 32G memory and not used by other programs. > 02:40:31.221 [r-worker-1] INFO a.p.p.crawl.impl.StreamingCrawler - 1. numRunning:...

What about clean the project and build it again?