GoOse icon indicating copy to clipboard operation
GoOse copied to clipboard

Not working with non-latin symbols

Open max107 opened this issue 11 years ago • 4 comments
trafficstars

article.CleanedText always empty

max107 avatar May 25 '14 16:05 max107

Should be fixed now (for Russian, Arabic and Chinese). If you want to recognize more languages you have to add Stopwords into stopwords.go file

advancedlogic avatar Jul 23 '14 08:07 advancedlogic

Test:

package main

import (
    "github.com/advancedlogic/GoOse"
)

func main() {
    g := goose.New()
    article := g.ExtractFromUrl("http://habrahabr.ru/post/230885/")
    println("title", article.Title)
    println("description", article.MetaDescription)
    println("keywords", article.MetaKeywords)
    println("content", article.CleanedText)
    println("url", article.FinalUrl)
    println("top image", article.TopImage)
}

Result:

title Переезд в Лондон. Продолжение / Хабрахабр
description Часть 1

Моя новая работа была так себе. Я получил место младшего админа в отделе Windows Support. В обязанности входили дневные проверки, реагирование на алерты и первичная обработка запросов...
keywords великобритания, лондон
content
url http://habrahabr.ru/post/230885/
top image

max107 avatar Jul 24 '14 09:07 max107

The problem is that sometimes the page has not language metadata or reference. I will add a language detector based on stopwords statistics.

advancedlogic avatar Jul 28 '14 09:07 advancedlogic

hello there, I also ran into this problem too, the page I need to extract is chinese, do you have any plan to fix this? or is there some quick way to fix this? suppose I know the page always use chinese.

zx9597446 avatar Aug 09 '15 12:08 zx9597446