sacremoses Add lowercase script?

trafficstars

Moses scripts included a useful lowercasing script. Are there any plans to add this?

Jan 14 '20 14:01 mayhewsw

This is actually quite trivial in Python and on command line, not sure whether adding a lowercase script would be beneficial.

In Python:

s = "abc" 
s.lower()

On command line:

tr [:upper:] [:lower:] < in.txt > out.txt

But if more people vote +1 on the idea, it's not hard to implement and add it =)

Jan 15 '20 08:01 alvations

It's definitely easy to do (although I always have to google the command line version), but I often find myself looking for a script to do it, and the original moses had one.

Jan 20 '20 03:01 mayhewsw

Caution with tr. Most versions are not Unicode compliant: https://stackoverflow.com/a/13383175/674487

Jan 20 '20 08:01 noe

For what it's worth, tr (bash default installation on Mac 10.15.2) seems to work fine.

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
он жил в москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
έζησε στη μόσχα

Jan 20 '20 17:01 mayhewsw

Not in GNU coreutils 8.28 (ubuntu 18.04.03):

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
Он жил в Москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

Jan 20 '20 19:01 noe

Interesting. Hmmm, so is that feature in the sacremoses CLI worth implementing?

@noe's point to https://stackoverflow.com/questions/13381746/tr-upper-lower-with-cyrillic-text/13383175#13383175 is right, on Ubuntu

$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

$ echo "Έζησε στη Μόσχα." | sed 's/[[:upper:]]*/\L&/'
έζησε στη Μόσχα.

Jan 23 '20 09:01 alvations

To me, having the lowercasing in sacremoses CLI would be useful because:

It would relieve me from googling for the correct perl/awk one-liner every time I need to do it.
It would provide a point to fix context-dependent problems that probably shouldn't be fixed in awk/perl, like those described here (i.e. having a word-ending Σ should be converted to ς instead of σ)

Jan 23 '20 10:01 noe

+1.

It would also be nice to chain operations, e.g.,

echo This is a test | sacremoses normalize [options] lowercase [options]...

Mar 31 '20 19:03 mjpost

@mayhewsw @noe @mjpost No promises but lowercase is a low-hanging fruit. Lets see how far I get go by end of the week of this sprint =)

@mjpost good idea on pipelining. Any other interface to follow? Anyone can point to similar pipelining interface in CLI? Maybe it should start with how we want to do in within Python first then move to CLI?

References:

https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
https://huggingface.co/transformers/main_classes/pipelines.html
https://spacy.io/usage/processing-pipelines

Apr 13 '20 04:04 alvations

@mjpost Good news on chaining the commands for pipelining https://click.palletsprojects.com/en/7.x/commands/#multi-command-pipelines =)

Gonna be a fun Tuesday tomorrow, implementing this!!

Apr 13 '20 12:04 alvations

Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe

I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.

Anyone knows how UNIX does streams and pipes? Any pointers?

Apr 14 '20 04:04 alvations

All the processors are generators. So at the top level you should be able to just pass one sentence at a time through each of them, right? I don’t see what about this requires you to load all the data (but I agree that you cannot do that!)

matt

On Apr 14, 2020, at 00:08, alvations [email protected] wrote:

Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe https://github.com/alvations/warppipe I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.

Anyone knows how UNIX does streams and pipes? Any pointers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alvations/sacremoses/issues/82#issuecomment-613214949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADPEWBM5EMDZVQA3MBGTBDRMPOUFANCNFSM4KGTUVCQ.

Apr 14 '20 12:04 mjpost

Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.

From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe

P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.

Apr 14 '20 22:04 alvations

Loading into RAM by default is a huge mistake. You are introducing a hardware constraint where there doesn’t need to be one. IMO it’s not worth the complexity to even permit preloading, just to save a few seconds. It doesn’t matter.

matt (from my phone)

Le 14 avr. 2020 à 18:20, alvations [email protected] a écrit :

Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.

From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe

P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

Apr 14 '20 23:04 mjpost

With pipeline feature out of the way, coming back to lowercase, any ideas/suggestions of what options one would need for sacremoses lowercase?

I guess with the pipeline global, the lowercase command would look something like this:

cat big.txt | sacremoses -j 4 -l en lowercase [OPTIONS]

May 04 '20 01:05 alvations

I can't think of any options for lowercase. That looks good above.

May 04 '20 12:05 mjpost

I wonder if a "reverse lowercase" option would be useful. Sometimes you want everything in upper case.

May 04 '20 12:05 mayhewsw

@mayhewsw I can't think of a frequent NLP usecase where everything needs to be uppercase. What did you have in mind?

May 06 '20 10:05 bricksdont

I agree that it's not frequent, but sometimes it's useful, and if the pipeline is already there, it shouldn't be hard to add w.upper(). One example: in this paper the authors wanted to create all upper case training data for robustness to NER.

May 06 '20 13:05 mayhewsw

There's something better coming up, upper, lower and a surprise. But it'll take a couple of days to free myself up for some more coding and finishing up the feature =)

Jun 04 '20 00:06 alvations

sacremoses sacremoses copied to clipboard

Add lowercase script?

sacremoses
sacremoses copied to clipboard