photon
photon copied to clipboard
Split types to optimize IDF?
I've just had a very interesting discussion with Julien, the guy who develops JDONREF. We compared each other ways of handling some of the specificity of searching for addresses.
One key point that went out that may be interesting for us to test is that he is splitting each types to have IDF computed only type by type: housenumber, streets, places, etc. So for example, a very common name in the streets can be rare in the places, and then have a better IDF score. For example, in France, we have a village names "Rue", which is also the French word for "street". But this would also have an impact for search with big cities, like "Paris", that are also in many streets, like "rue de Paris", "boulevard de Paris", etc.
So that may be a candidate for our next sprint :)
when we were still using the solr implementation, we were using a separate idf for every field. With this approach more than 80% of all search bugs were related to an idf going crazy at one field. Let's say there are 1000 addresses that have new york
in their name. Sometimes stuff was mapped incorrectly and maybe new york
was used in `housenumber. This address then popped at at first place because idf was very high.
But sure, if the data was perfect this would be the better and more powerful solution and maybe there is a better way to circumvent those cases than using an collector field.