Code tags are not parsed properly
It seems like trafilatura treats all <code> as new line, which is incorrect. I used this link for both examples below:
https://www.django-rest-framework.org/api-guide/filtering/
Example 1
<p>The simplest way to filter the queryset of any view that subclasses <code>GenericAPIView</code> is to override the <code>.get_queryset()</code> method.</p>
Expected
The simplest way to filter the queryset of any view that subclasses GenericAPIView is to override the .get_queryset() method.
Actual:
The simplest way to filter the queryset of any view that subclasses GenericAPIView is to override the .get_queryset() method.
Example 2
<pre class="prettyprint well"><code><span class="kwd">from</span><span class="pln"> myapp</span><span class="pun">.</span><span class="pln">models </span><span class="kwd">import</span><span class="pln"> </span><span class="typ">Purchase</span><span class="pln">
</span><span class="kwd">from</span><span class="pln"> myapp</span><span class="pun">.</span><span class="pln">serializers </span><span class="kwd">import</span><span class="pln"> </span><span class="typ">PurchaseSerializer</span><span class="pln">
</span><span class="kwd">from</span><span class="pln"> rest_framework </span><span class="kwd">import</span><span class="pln"> generics
</span><span class="kwd">class</span><span class="pln"> </span><span class="typ">PurchaseList</span><span class="pun">(</span><span class="pln">generics</span><span class="pun">.</span><span class="typ">ListAPIView</span><span class="pun">):</span><span class="pln">
serializer_class </span><span class="pun">=</span><span class="pln"> </span><span class="typ">PurchaseSerializer</span><span class="pln">
</span><span class="kwd">def</span><span class="pln"> get_queryset</span><span class="pun">(</span><span class="kwd">self</span><span class="pun">):</span><span class="pln">
</span><span class="str">"""
This view should return a list of all the purchases
for the currently authenticated user.
"""</span><span class="pln">
user </span><span class="pun">=</span><span class="pln"> </span><span class="kwd">self</span><span class="pun">.</span><span class="pln">request</span><span class="pun">.</span><span class="pln">user
</span><span class="kwd">return</span><span class="pln"> </span><span class="typ">Purchase</span><span class="pun">.</span><span class="pln">objects</span><span class="pun">.</span><span class="pln">filter</span><span class="pun">(</span><span class="pln">purchaser</span><span class="pun">=</span><span class="pln">user</span><span class="pun">)</span></code></pre>
Expected
from myapp.models import Purchase
from myapp.serializers import PurchaseSerializer
from rest_framework import generics
class PurchaseList(generics.ListAPIView):
serializer_class = PurchaseSerializer
def get_queryset(self):
"""
This view should return a list of all the purchases
for the currently authenticated user.
"""
user = self.request.user
return Purchase.objects.filter(purchaser=user)
Actual
from myapp.models import Purchase from myapp.serializers import PurchaseSerializer from rest_framework import generics class PurchaseList(generics.ListAPIView): serializer_class = PurchaseSerializer def get_queryset(self): """ This view should return a list of all the purchases for the currently authenticated user. """ user = self.request.user return Purchase.objects.filter(purchaser=user)
I understand your point but in this case its code within a p element, which affects its processing. The software has reached a balance and although improvements are still possible it is difficult to implement and to test them. Let's keep the issue open.
I understand.
This is a service that I use frequently that converts webpages to epub. From a few pages that I tested, it seems to do a much better job than Trafilatura: https://pushtokindle.fivefilters.org/send.php?url=https%3A%2F%2Fwww.django-rest-framework.org%2Fapi-guide%2Ffiltering%2F
This issue is now solved.