trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Corrupted Markdown output when TXT+formatting

Open clach04 opened this issue 10 months ago • 2 comments

I wrote a fairly complicated testcase.. then realized I could use the command line tool :-D

The docs indicate Markdown is an option https://github.com/adbar/trafilatura/blob/d78fbb5e0d88566cb1326f04210a93b46db8ac87/docs/usage-python.rst?plain=1#L71

  • The plain text output (no Markdown) looks good.
  • In the examples I've tried so far the Markdown output is not usable, it appears to have the same content as text BUT the formatting is incorrect, new paragraph (line) breaks appear at odd places (e.g. the 2nd character on a line).

Demo

Session 1 - server test data

Get test data (once) and serve it to avoid repeatedly hitting web site (I could not see a way to pass in a file to trafilatura)

wget -O wget_output.html http://www.pcgamer.com/2012/08/09/an-illusionist-in-skyrim-part-1/
echo http://localhost:1234/wget_output.html
python3 -m http.server 1234

Session 2 - scrape data

cd /tmp
mkdir trafilatura_demo
cd trafilatura_demo/

python3 -m venv py3venv
. py3venv/bin/activate
python -m pip install trafilatura

trafilatura --version

Then:

# good text output, without formatting
trafilatura -u http://localhost:1234/wget_output.html 


# not great - some new lines show up
trafilatura --links -u http://localhost:1234/wget_output.html 
trafilatura --links --images -u http://localhost:1234/wget_output.html 

# messed up parapgraphs and newlines in markdown
trafilatura --formatting --links --images -u http://localhost:1234/wget_output.html 
trafilatura --formatting -u http://localhost:1234/wget_output.html 

Partial extract showing problem:

In
[Skyrim]...
....
"
*Legends ....

There are others in the same document but I'm reluctant to include too much of the content. Hopefully the test case above is enough to reproduce for other people.

It's really obvious there is odd formatting when converting back into html (e.g. using pandoc in gfm mode, or any other md2html tool).


There is no option for html (only xml) which was my idea for a workaround.

I did poke around the code but I can;t get a handle on why white space is being injected into the xml cleaning code (I can see there are reasons for it, my ham fisted attempt to remove them all was unsuccessful :-D).

Thanks for making this tool available, I'm using the python readability module and trafilatura does a much better job at the meta data extraction (so far, readability works better for me for content extraction). I'm not sure if I'm misusing the the library.

clach04 avatar Aug 06 '23 22:08 clach04