GoOse Not working with non-latin symbols

Not working with non-latin symbols

Open max107 opened this issue 11 years ago • 4 comments

trafficstars

article.CleanedText always empty

May 25 '14 16:05 max107

Should be fixed now (for Russian, Arabic and Chinese). If you want to recognize more languages you have to add Stopwords into stopwords.go file

Jul 23 '14 08:07 advancedlogic

Test:

package main

import (
    "github.com/advancedlogic/GoOse"
)

func main() {
    g := goose.New()
    article := g.ExtractFromUrl("http://habrahabr.ru/post/230885/")
    println("title", article.Title)
    println("description", article.MetaDescription)
    println("keywords", article.MetaKeywords)
    println("content", article.CleanedText)
    println("url", article.FinalUrl)
    println("top image", article.TopImage)
}

Result:

title Переезд в Лондон. Продолжение / Хабрахабр
description Часть 1

Моя новая работа была так себе. Я получил место младшего админа в отделе Windows Support. В обязанности входили дневные проверки, реагирование на алерты и первичная обработка запросов...
keywords великобритания, лондон
content
url http://habrahabr.ru/post/230885/
top image

Jul 24 '14 09:07 max107

The problem is that sometimes the page has not language metadata or reference. I will add a language detector based on stopwords statistics.

Jul 28 '14 09:07 advancedlogic

hello there, I also ran into this problem too, the page I need to extract is chinese, do you have any plan to fix this? or is there some quick way to fix this? suppose I know the page always use chinese.

Aug 09 '15 12:08 zx9597446

GoOse GoOse copied to clipboard

Not working with non-latin symbols

GoOse
GoOse copied to clipboard