gse icon indicating copy to clipboard operation
gse copied to clipboard

feature: any plan on implementing the bm25 algorithm ?

Open CocaineCong opened this issue 1 year ago • 2 comments

Description

I see the bm25.go file in path(hmm/bm25/bm25.go), so I wanna ask author any plan on bm25 ? 😃

If author had the plan on implementing the bm25, I want to make it. 🫡

image

CocaineCong avatar Oct 30 '23 14:10 CocaineCong

Emm, A lot of features are planned, but my time is limited, and there any contributions welcome.

vcaesar avatar Oct 30 '23 15:10 vcaesar

Hey, @vcaesar there is my plan on the bm25 algorithm. 🧑🏻‍💻

image

First, in the file pathhmm/idf/tag_extracker.go, I think the struct of Segment should be extracted to the new file.

type Segment struct {
	text   string
	weight float64
}

because this structure of segment should be in the state of being called at all times. if in tag_extracker, maybe there will be have some cycle import trouble. 🥲

Secondly, still in the file hmm/idf/tag_extracker.go , the struct of TagExtracter , the field of Idf should be abstracted into a relevance computation to implementing more algorithm such as IDF, TFIDF, BM25 .

before:

type TagExtracter struct {
	seg gse.Segmenter

	Idf      *Idf
	stopWord *StopWord
}

after:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

what's more, the field of stopWord should in the struct of Relevance. since this field is used in the word splitting, it should be used in the struct of relevance.

before:

type TagExtracter struct {
	seg gse.Segmenter
	// calculate weight by Relevance(including IDF,TF-IDF,BM25 and so on)
	Relevance relevance.Relevance
	stopWord *StopWord
}

after:

type IDF struct {
	median float64
	freqs []float64
	Base
}

type BM25 struct {
	K1 float64
	N float64
	Base
}

type Base struct {
	// loading some stop words
	StopWord *stop_word.StopWord

	// loading segmenter for cut word
	Seg gse.Segmenter
}

And then, I want to implementing the Relevance by Strategy Pattern.

such as :

// Relevance easily scalable Relevance calculations (for idf, tf-idf, bm25 and so on)
type Relevance interface {
	// AddToken add text, frequency, position on obj
	AddToken(text string, freq float64, pos ...string) error

	// LoadDict load file from incoming parameters,
	// if incoming params no exist, will load file from default file path
	LoadDict(files ...string) error

	// LoadDictStr loading dict file by file path
	LoadDictStr(pathStr string) error

	// LoadStopWord loading word file by filename
	LoadStopWord(fileName ...string) error

	// Freq find the frequency, position, existence information of the key
	Freq(key string) (float64, string, bool)

	// TotalFreq the total number of tokens in the dictionary
	TotalFreq() float64

	// FreqMap get frequency map
	// key: word, value: frequency
	FreqMap(text string) map[string]float64

	// ConstructSeg return the segment with weight
	ConstructSeg(text string) segment.Segments
}

default IDF:

func NewIDF() Relevance {
	idf := &IDF{
		freqs: make([]float64, 0),
	}
	idf.StopWord = stop_word.NewStopWord()
	return Relevance(idf)
}

implement the interface function

// AddToken add a new word with IDF into the dictionary.
func (i *IDF) AddToken(text string, freq float64, pos ...string) error {
	err := i.Seg.AddToken(text, freq, pos...)

	i.freqs = append(i.freqs, freq)
	sort.Float64s(i.freqs)
	i.median = i.freqs[len(i.freqs)/2]
	return err
}

// LoadDict load the idf dictionary
func (i *IDF) LoadDict(files ...string) error {
	if len(files) <= 0 {
		files = i.Seg.GetIdfPath(files...)
	}

	return i.Seg.LoadDict(files...)
}

// Freq return the IDF of the word
func (i *IDF) Freq(key string) (float64, string, bool) {
	return i.Seg.Find(key)
}

....
image

any problem about that? if no big problem, I'll pr bit by bit to make it possible. 🫡

  • [x] extracting stop word
  • [x] extracting Segment with weight
  • [x] extracting extracker
  • [x] converging idf,bm25 in relevance
  • [x] stop word should work and in relevance algorithm struct
  • [x] relevance algorithm implement by Strategy Pattern

CocaineCong avatar Nov 01 '23 16:11 CocaineCong