punkt
punkt copied to clipboard
line breaks
It seems that line breaks confuse Punkt.
segment_test :: [FilePath] -> IO ()
segment_test file = do
content <- readFile file
mapM_ print (split_sentences $ pack content)
The content of https://github.com/cpdoc/dhbb-nlp/blob/master/raw/1.raw is passed to the split:
λ> segment_test "/Users/ar/work/cpdoc/dhbb-nlp/raw/1.raw"
"\171Jos\233 Machado Coelho de Castro\187 nasceu em Lorena (SP).\nEstudou no Gin\225sio Diocesano de S\227o Paulo e bacharelou-se em 1910 pela Faculdade de Ci\234ncias Jur\237dicas e Sociais."
...
I would expect
λ> segment_test "/Users/ar/work/cpdoc/dhbb-nlp/raw/1.raw"
"\171Jos\233 Machado Coelho de Castro\187 nasceu em Lorena (SP)."
"Estudou no Gin\225sio Diocesano de S\227o Paulo e bacharelou-se em 1910 pela Faculdade de Ci\234ncias Jur\237dicas e Sociais."
...
Actually, after more tests, it looks like the error is caused by the closing parenthesis ending the sentence. Any clue about how to solve it? Can you point me where I should investigate? Maybe I can send a PR.