photon icon indicating copy to clipboard operation
photon copied to clipboard

Split types to optimize IDF?

Open yohanboniface opened this issue 10 years ago • 1 comments

I've just had a very interesting discussion with Julien, the guy who develops JDONREF. We compared each other ways of handling some of the specificity of searching for addresses.

One key point that went out that may be interesting for us to test is that he is splitting each types to have IDF computed only type by type: housenumber, streets, places, etc. So for example, a very common name in the streets can be rare in the places, and then have a better IDF score. For example, in France, we have a village names "Rue", which is also the French word for "street". But this would also have an impact for search with big cities, like "Paris", that are also in many streets, like "rue de Paris", "boulevard de Paris", etc.

So that may be a candidate for our next sprint :)

yohanboniface avatar Dec 12 '14 11:12 yohanboniface

when we were still using the solr implementation, we were using a separate idf for every field. With this approach more than 80% of all search bugs were related to an idf going crazy at one field. Let's say there are 1000 addresses that have new york in their name. Sometimes stuff was mapped incorrectly and maybe new york was used in `housenumber. This address then popped at at first place because idf was very high.

But sure, if the data was perfect this would be the better and more powerful solution and maybe there is a better way to circumvent those cases than using an collector field.

christophlingg avatar Dec 12 '14 12:12 christophlingg