LBeaudoux

Results 26 comments of LBeaudoux

> (1) For people who intend to translate or to find sentences with most translations could be useful, these sentences could be sometimes among the most popular/universal or the most...

> this might affect French as well In French, these duplicates are quite common and most often due to the use or non-use of a (narrow) non-breaking space before the...

Such a leaderboard could be interesting. Here is what a top 40 might look like for the year 2021: | contributor | language code | percentage unoriginal sentences | total...

> could you provide some examples of searches that result in what you would consider "a long list of very similar sentences that look like they were generated by a...

> I wonder if there could be a way to cluster search results by sentence similarity I agree that this would be a clever way to increase the diversity of...

After experimenting with various approaches, I finally settled on a method that aims at clustering [semantically related sentences](https://en.wikipedia.org/wiki/Semantic_similarity) in addition to paraphrases, and can be generalized to all languages. ###...

By identifying clusters with sentence IDs, we can use the ID of a sentence added between two clustering runs as its cluster ID without the risk of interfering. I don't...

> let’s be careful and take contributors’ input before going into production. I agree. We should also allow users to disable the clustering feature. > can you give us some...

@jiru It's been almost a year since I put my Tatoeba-related projects on the back burner. Thank you for reminding me of my past commitments. I'd like to take some...

I also encountered this cluttering issue when cleaning the Tatoeba [search log](https://downloads.tatoeba.org/stats/) for [Tatominer](https://tatominer.netlify.app). I fixed it by building a lexicon for each supported language: when a search query can...