punkt icon indicating copy to clipboard operation
punkt copied to clipboard

line breaks

Open arademaker opened this issue 4 years ago • 1 comments

It seems that line breaks confuse Punkt.

segment_test :: [FilePath] -> IO ()
segment_test file = do
  content <- readFile file
  mapM_ print  (split_sentences $ pack content)

The content of https://github.com/cpdoc/dhbb-nlp/blob/master/raw/1.raw is passed to the split:

λ> segment_test "/Users/ar/work/cpdoc/dhbb-nlp/raw/1.raw"
"\171Jos\233 Machado Coelho de Castro\187 nasceu em Lorena (SP).\nEstudou no Gin\225sio Diocesano de S\227o Paulo e bacharelou-se em 1910 pela Faculdade de Ci\234ncias Jur\237dicas e Sociais."
...

I would expect

λ> segment_test "/Users/ar/work/cpdoc/dhbb-nlp/raw/1.raw"
"\171Jos\233 Machado Coelho de Castro\187 nasceu em Lorena (SP)."
"Estudou no Gin\225sio Diocesano de S\227o Paulo e bacharelou-se em 1910 pela Faculdade de Ci\234ncias Jur\237dicas e Sociais."
...

arademaker avatar Feb 23 '21 01:02 arademaker

Actually, after more tests, it looks like the error is caused by the closing parenthesis ending the sentence. Any clue about how to solve it? Can you point me where I should investigate? Maybe I can send a PR.

arademaker avatar Feb 23 '21 18:02 arademaker