cdec icon indicating copy to clipboard operation
cdec copied to clipboard

tokenize-anything.sh does not preserve number of lines

Open egrefen opened this issue 11 years ago • 3 comments

I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.

egrefen avatar Dec 02 '13 18:12 egrefen

Thanks for the heads up. If you find a repro case, I'll fix it. Ideally by just junking the whole thing...

On Mon, Dec 2, 2013 at 1:42 PM, Edward Grefenstette < [email protected]> wrote:

I've observed that corpus/tokenize-anything.sh does not necessarily preserve the number of lines on the files it is fed. I'll try and isolate the problem and give more details.

— Reply to this email directly or view it on GitHubhttps://github.com/redpony/cdec/issues/31 .

redpony avatar Dec 02 '13 19:12 redpony

I have a repro case in my computer. Will see what lines are causing trouble when I get back from dinner.

egrefen avatar Dec 02 '13 19:12 egrefen

Okay the problem seems to be with rogue DOS carriage return characters (\r) in the text rather than with tokenize-anything.sh. I think wc doesn't spot them but your script converts them to \n.

egrefen avatar Dec 03 '13 12:12 egrefen