pragmatic_segmenter icon indicating copy to clipboard operation
pragmatic_segmenter copied to clipboard

French à after abbreviation / Min. abbreviations

Open coezbek opened this issue 3 years ago • 0 comments

I noticed that à seems not to be detected as a lower-case letter after an abbreviation:

    assert_equal 1, segment("85,7 cm (33 3/4 po) min. à 88,9 cm (35 po) max.", 'fr').size # Fails

Digging into this it seems that there are two separate issues: min seems to be missing as an abbreviation in French (it might be common in many languages), but also the à doesn't work in english:

    assert_equal 1, segment("33-3/4” (85.7 cm) min. to 35” (88.9 cm) max.", 'en').size # Works
    assert_equal 1, segment("33-3/4” (85.7 cm) min. to 35” (88.9 cm) max.", 'fr').size # Fails
    assert_equal 1, segment("85,7 cm (33 3/4 po) min. à 88,9 cm (35 po) max.", 'en').size # Fails

As a test for the test suite:

    it "French à after abbreviation" do
      sentence = "85,7 cm (33 3/4 po) min. à 88,9 cm (35 po) max."
      ps = PragmaticSegmenter::Segmenter.new(text: sentence, language: "fr")
      expect(ps.segment).to eq([sentence])
    end

coezbek avatar Jan 06 '22 12:01 coezbek