trafilatura
trafilatura copied to clipboard
Collected links as metadata field?
I need to analyze all HTTP links in scraped article. For that I need to have a list of URLs. But Trafilatura not provides any way for that:
- it doesn't returns article's node which was detected, so that I can parse this node by myself
- it doesn't collects article's links in separate list
For example look at Goose3 - https://goose3.readthedocs.io/en/latest/code.html#article It provides links
result field.
For example look at go-readability (rewrite of Mozilla's Readability in Go) - https://pkg.go.dev/github.com/go-shiori/go-readability?utm_source=godoc#Article It provides HTML node of article, so that I can interact with this node.
At the moment I came up with this solution:
- parse first time to get article
text
- parse one more time but with
include_links = True
, then extract all links intext
using RegExp. i.e. links in the article will be represented as[text](https://test.com)
, which can be then extracted using RegExp. № 1 is needed in order to use clean text because it is error prone to try to manually clean text from № 2
Hi @Amaimersion, I believe you can do it without adding a new feature.
You can work on cleaner text and on nodes by using XML as output format:
-
extract(your_document, include_links=True, output_format="xml")
- regex on the
ref
elements in the result or parser on the document
Shortcut in "expert mode": bare_extraction
+ direct operation on the main element node which is a LXML etree element.
Does that answer your question?
Thank you @adbar! Yes that all answer my question and with this I'm able to extract article-only links.
But I still wonder if this can be added as part of Trafilatura to make this process easier. Trafilatura anyway have access to the article node at parsing time, so it can travel all article-only links and return it along with other article-related data without involving external developer to perform any extra manipulations.
Or this logic conflicts with something?
The links could indeed be added as a metadata field, at the cost of a bit of rewiring in the code.