GoOse
GoOse copied to clipboard
Not working with non-latin symbols
trafficstars
article.CleanedText always empty
Should be fixed now (for Russian, Arabic and Chinese). If you want to recognize more languages you have to add Stopwords into stopwords.go file
Test:
package main
import (
"github.com/advancedlogic/GoOse"
)
func main() {
g := goose.New()
article := g.ExtractFromUrl("http://habrahabr.ru/post/230885/")
println("title", article.Title)
println("description", article.MetaDescription)
println("keywords", article.MetaKeywords)
println("content", article.CleanedText)
println("url", article.FinalUrl)
println("top image", article.TopImage)
}
Result:
title Переезд в Лондон. Продолжение / Хабрахабр
description Часть 1
Моя новая работа была так себе. Я получил место младшего админа в отделе Windows Support. В обязанности входили дневные проверки, реагирование на алерты и первичная обработка запросов...
keywords великобритания, лондон
content
url http://habrahabr.ru/post/230885/
top image
The problem is that sometimes the page has not language metadata or reference. I will add a language detector based on stopwords statistics.
hello there, I also ran into this problem too, the page I need to extract is chinese, do you have any plan to fix this? or is there some quick way to fix this? suppose I know the page always use chinese.