trafilatura icon indicating copy to clipboard operation
trafilatura copied to clipboard

Code tags are not parsed properly

Open charleshan opened this issue 2 years ago • 2 comments

It seems like trafilatura treats all <code> as new line, which is incorrect. I used this link for both examples below: https://www.django-rest-framework.org/api-guide/filtering/

Example 1

<p>The simplest way to filter the queryset of any view that subclasses <code>GenericAPIView</code> is to override the <code>.get_queryset()</code> method.</p>

Expected

The simplest way to filter the queryset of any view that subclasses GenericAPIView is to override the .get_queryset() method.

Actual:

The simplest way to filter the queryset of any view that subclasses GenericAPIView is to override the .get_queryset() method.

Example 2

<pre class="prettyprint well"><code><span class="kwd">from</span><span class="pln"> myapp</span><span class="pun">.</span><span class="pln">models </span><span class="kwd">import</span><span class="pln"> </span><span class="typ">Purchase</span><span class="pln">
</span><span class="kwd">from</span><span class="pln"> myapp</span><span class="pun">.</span><span class="pln">serializers </span><span class="kwd">import</span><span class="pln"> </span><span class="typ">PurchaseSerializer</span><span class="pln">
</span><span class="kwd">from</span><span class="pln"> rest_framework </span><span class="kwd">import</span><span class="pln"> generics

</span><span class="kwd">class</span><span class="pln"> </span><span class="typ">PurchaseList</span><span class="pun">(</span><span class="pln">generics</span><span class="pun">.</span><span class="typ">ListAPIView</span><span class="pun">):</span><span class="pln">
    serializer_class </span><span class="pun">=</span><span class="pln"> </span><span class="typ">PurchaseSerializer</span><span class="pln">

    </span><span class="kwd">def</span><span class="pln"> get_queryset</span><span class="pun">(</span><span class="kwd">self</span><span class="pun">):</span><span class="pln">
        </span><span class="str">"""
        This view should return a list of all the purchases
        for the currently authenticated user.
        """</span><span class="pln">
        user </span><span class="pun">=</span><span class="pln"> </span><span class="kwd">self</span><span class="pun">.</span><span class="pln">request</span><span class="pun">.</span><span class="pln">user
        </span><span class="kwd">return</span><span class="pln"> </span><span class="typ">Purchase</span><span class="pun">.</span><span class="pln">objects</span><span class="pun">.</span><span class="pln">filter</span><span class="pun">(</span><span class="pln">purchaser</span><span class="pun">=</span><span class="pln">user</span><span class="pun">)</span></code></pre>

Expected

from myapp.models import Purchase
from myapp.serializers import PurchaseSerializer
from rest_framework import generics

class PurchaseList(generics.ListAPIView):
    serializer_class = PurchaseSerializer

    def get_queryset(self):
        """
        This view should return a list of all the purchases
        for the currently authenticated user.
        """
        user = self.request.user
        return Purchase.objects.filter(purchaser=user)

Actual

from myapp.models import Purchase from myapp.serializers import PurchaseSerializer from rest_framework import generics class PurchaseList(generics.ListAPIView): serializer_class = PurchaseSerializer def get_queryset(self): """ This view should return a list of all the purchases for the currently authenticated user. """ user = self.request.user return Purchase.objects.filter(purchaser=user)

charleshan avatar Jul 08 '23 06:07 charleshan

I understand your point but in this case its code within a p element, which affects its processing. The software has reached a balance and although improvements are still possible it is difficult to implement and to test them. Let's keep the issue open.

adbar avatar Jul 10 '23 10:07 adbar

I understand.

This is a service that I use frequently that converts webpages to epub. From a few pages that I tested, it seems to do a much better job than Trafilatura: https://pushtokindle.fivefilters.org/send.php?url=https%3A%2F%2Fwww.django-rest-framework.org%2Fapi-guide%2Ffiltering%2F

charleshan avatar Jul 10 '23 20:07 charleshan

This issue is now solved.

adbar avatar Apr 19 '24 11:04 adbar