fortyfourforty
fortyfourforty
I'm also interested in getting the index range of extract keywords. Let's say if we don't remove stop words, so the extract keywords are the same as is from the...
> Trafilatura's download utilities should stay simple in order not to confuse users. There are lots of alternatives and downloading at scale is a different challenge altogether. A worst case...
sorry, I forgot about archive.is. Noted. I don't think using deduplicate = True is a valid workaround as there are some pages that do have extact same text segments on...
I wish I could but my little, self-taught knowledge of Python and GitHub does not allow me to get my hands on PRs. 😞