trafilatura
trafilatura copied to clipboard
Is it possible to get the metadata with markdown format?
There are useful information when we output to json format, such as: title, author, and date. However, it looks like json only has raw_text as the content format.
The workaround is extracting in both json and txt with include_formatting but I think we can do better
Good point, the code here could definitely be improved to add further metadata:
https://github.com/adbar/trafilatura/blob/123414cae5f927e743f5eced2cd43b81a65fc43c/trafilatura/xml.py#L41