SmartReader
SmartReader copied to clipboard
SmartReader is a library to extract the main content of a web page, based on a port of the Readability library by Mozilla
Thresholds like `MinContentLengthReadearable ` must be language sensitive. For example the default values work quite well for English. But for Chinese they should be something like 8x lower. The obvious...
Testing on https://kotaku.com/destiny-2-witch-deepsight-resonance-crafting-solstice-1849392326 I get no result with SmartReader, but I do with Readability.js. I use Playwright (headless Chrome) to get the html and feed it to SmartReader. > no...
This is not a typically request, so I understand if it doesn't make sense. I am using SmartReader in a project. I use AngleSharp to parse the document and extract...
Hello, we are using your library but we encountered an issue with german sites, it cannot convert special characters like this Ö, Ä data:image/s3,"s3://crabby-images/dd0bb/dd0bb29074ba22587320400221efee39625c781f" alt="image" I tried even putting in into...
[Fasttext](https://fasttext.cc/) will allow to implement effective language identification, with little space and resources required. This can be useful for a lot of content that has no language and also for...
The library can extract any manual excerpt that is contained in the article (i.e., the short summary that usually is shown in Facebook or Twitter). However, it can be useful...
The Reader class creates a IHtmlDocument object which needs to be disposed of. Workaround using reflection. (typeof(Reader).GetField("doc", BindingFlags.NonPublic | BindingFlags.Instance).GetValue(reader) as IHtmlDocument)?.Dispose();
For this code: ```cs Article article = Reader.ParseArticle(url); ``` If request to the url returned `HTTP 403 Forbidden`, it will raise `HttpRequestException` exception. The exception object will contains a `.StatusCude`...
Not an issue, but discussions aren't enabled so couldn't resist posting here. This package is great and saved me _a lot_ of time, thanks for the effort and thanks for...
The methods ConvertToPlaintext and ConvertToText can do with a few small performance improvements. We found a few HTML pages where those methods would take 10 minutes to run, and after...