cdec
cdec copied to clipboard
tokenize-anything.sh does not preserve number of lines
I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.
Thanks for the heads up. If you find a repro case, I'll fix it. Ideally by just junking the whole thing...
On Mon, Dec 2, 2013 at 1:42 PM, Edward Grefenstette < [email protected]> wrote:
I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.
— Reply to this email directly or view it on GitHubhttps://github.com/redpony/cdec/issues/31 .
I have a repro case in my computer. Will see what lines are causing trouble when I get back from dinner.
Okay the problem seems to be with rogue DOS carriage return characters (\r
) in the text rather than with tokenize-anything.sh
. I think wc
doesn't spot them but your script converts them to \n
.