trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Collected links as metadata field?

Open Amaimersion opened this issue 2 years ago • 3 comments

I need to analyze all HTTP links in scraped article. For that I need to have a list of URLs. But Trafilatura not provides any way for that:

  • it doesn't returns article's node which was detected, so that I can parse this node by myself
  • it doesn't collects article's links in separate list

For example look at Goose3 - https://goose3.readthedocs.io/en/latest/code.html#article It provides links result field. For example look at go-readability (rewrite of Mozilla's Readability in Go) - https://pkg.go.dev/github.com/go-shiori/go-readability?utm_source=godoc#Article It provides HTML node of article, so that I can interact with this node.

At the moment I came up with this solution:

  1. parse first time to get article text
  2. parse one more time but with include_links = True, then extract all links in text using RegExp. i.e. links in the article will be represented as [text](https://test.com), which can be then extracted using RegExp. № 1 is needed in order to use clean text because it is error prone to try to manually clean text from № 2

Amaimersion avatar Jan 26 '23 09:01 Amaimersion

Hi @Amaimersion, I believe you can do it without adding a new feature.

You can work on cleaner text and on nodes by using XML as output format:

  1. extract(your_document, include_links=True, output_format="xml")
  2. regex on the ref elements in the result or parser on the document

Shortcut in "expert mode": bare_extraction + direct operation on the main element node which is a LXML etree element.

Does that answer your question?

adbar avatar Jan 26 '23 13:01 adbar

Thank you @adbar! Yes that all answer my question and with this I'm able to extract article-only links.

But I still wonder if this can be added as part of Trafilatura to make this process easier. Trafilatura anyway have access to the article node at parsing time, so it can travel all article-only links and return it along with other article-related data without involving external developer to perform any extra manipulations.

Or this logic conflicts with something?

Amaimersion avatar Jan 26 '23 13:01 Amaimersion

The links could indeed be added as a metadata field, at the cost of a bit of rewiring in the code.

adbar avatar Jan 26 '23 14:01 adbar