PlatonAI comments

Results 89 comments of


                                            PlatonAI

InaccessibleObjectException: Unable to make field private final long java.time.Duration.seconds accessible

Solution: 1. allow the user specify the JAVA_HOME 2. fix Gson/Java8-datatime serialization problem with JDK-17 3. run all tests with JDK-17 see also: https://bugs.chromium.org/p/gerrit/issues/detail?id=15502

新手，请教Google Chrome is not found in your system报错，是需要chromedriver.exe吗，放到哪个文件夹下面？

PulsarRPA 不依赖 chromedriver。不确定是否支持 MINGW，也不确定 MINGW 下是否能够成功安装 chrome。PulsarRPA 支持 Windows 或者 WSL，因此我们并不推荐在 MINGW 下尝试 PulsarRPA。更多信息可以在项目首页找到，这里也有一个简明教程：[PulsarRPA 系列课程 - 目录](https://zhuanlan.zhihu.com/p/576130585)

如果网站仅支持手机+验证码登录或扫码登录，该怎么解决登录态的问题

这种情况只能够使用 GUI 模式，在登录页面上等待，人工登录。

如何使用headless chrome进行采集？

支持 linux 服务器版进行部署采集。浏览器安装： ``` git clone https://github.com/platonai/pulsar.git cd pulsar && bin/build-run.sh ``` 浏览器设置：用 BrowserSettings 设置，譬如： `BrowserSettings.privacy(3).maxTabs(10).headless()` 这段代码告诉系统， 1. 同时启动3套隐私独立的浏览器，每个浏览器互不干扰 2. 每个浏览器最大同时打开10个Tab 3. 使用无头模式 [中文教程](https://blog.csdn.net/weixin_48738961/article/details/127534381) [代码示例](https://github.com/platonai/pulsarr/blob/master/pulsar-app/pulsar-examples/src/main/kotlin/ai/platon/pulsar/examples/_9_MassiveCrawler.kt)。

Doesn't work with chrome v111

We have the same issue: https://github.com/platonai/exotic-amazon/issues/16 . We have developed a project to scrape web data at scale completely and accurately with high performance, distributed RPA, and the browser layer...

Doesn't work with chrome v111

Fixed my problem with @karlvr 's solution: 1. change http method to be PUT to create/activate/close a tab 2. add chrome launch parameter: --remote-allow-origins=*

How to extract the news detail page? 新闻详情页怎么提取？

``` val url = "https://www.eeo.com.cn/2024/0330/648712.shtml" val session = ScentContexts.createSession() val document = session.harvestArticle(url, session.options()) println(document.contentTitle) println(document.textContent) ``` [eeo.com.cn crawler](https://github.com/platonai/PulsarRPAPro/blob/a896725327482bf8cf2fc1b6372b2e2067436e42/exotic-app/exotic-examples/src/main/kotlin/ai/platon/exotic/examples/sites/news/eeo/EEO.kt) ![image](https://github.com/platonai/PulsarRPAPro/assets/37785921/1c7f068e-2d0b-4f84-928f-1e6baf268ffe)

How to extract the news detail page? 新闻详情页怎么提取？

If you need a open source solution, use the code below: ``` fun harvestArticle(page: WebPage): TextDocument { return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) } } ``` `ChineseNewsExtractor` is implemented in PulsarRPA.

How to extract the news detail page? 新闻详情页怎么提取？

> > 如果您需要开源解决方案，请使用以下代码： > > ``` > > fun harvestArticle(page: WebPage): TextDocument { > > return SAXInput().parse(page.baseUrl, page.contentAsSaxInputSource).also { ChineseNewsExtractor().process(it) } > > } > > ``` > > >...

How to extract the news detail page? 新闻详情页怎么提取？

> 不同的网站元素结构不同，每家公司网站都需要单独编写逻辑，比如amazon，zhihu，jd等等。项目主页 README 有介绍。更多信息： https://www.bilibili.com/video/BV1qV411R7Xq/ 这个视频介绍了我们的 AI 技术如何准确理解网页上的每一个字段，并且将网页转变为结构化数据或者Excel表格。使用无监督学习+监督学习进行网页数据提取，我们将网页数据提取的人效提升了1000倍以上，提升了数据提取准确率，降低了人员技能要求，同时也不再需要频繁维护数据提取规则。 http://platonic.fun/i/ai?url=aHR0cHM6Ly93d3cuaHVhLmNvbS9tZWlndWkv 这是 AI 技术准确理解并提取网页字段的实时演示。 https://www.bilibili.com/video/BV1Zi4y1h7aq/