sacremoses icon indicating copy to clipboard operation
sacremoses copied to clipboard

Add lowercase script?

Open mayhewsw opened this issue 5 years ago • 20 comments
trafficstars

Moses scripts included a useful lowercasing script. Are there any plans to add this?

mayhewsw avatar Jan 14 '20 14:01 mayhewsw

This is actually quite trivial in Python and on command line, not sure whether adding a lowercase script would be beneficial.

In Python:

s = "abc" 
s.lower()

On command line:

tr [:upper:] [:lower:] < in.txt > out.txt

But if more people vote +1 on the idea, it's not hard to implement and add it =)

alvations avatar Jan 15 '20 08:01 alvations

It's definitely easy to do (although I always have to google the command line version), but I often find myself looking for a script to do it, and the original moses had one.

mayhewsw avatar Jan 20 '20 03:01 mayhewsw

Caution with tr. Most versions are not Unicode compliant: https://stackoverflow.com/a/13383175/674487

noe avatar Jan 20 '20 08:01 noe

For what it's worth, tr (bash default installation on Mac 10.15.2) seems to work fine.

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
он жил в москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
έζησε στη μόσχα

mayhewsw avatar Jan 20 '20 17:01 mayhewsw

Not in GNU coreutils 8.28 (ubuntu 18.04.03):

$ echo "He lived in Moscow." | tr [:upper:] [:lower:]
he lived in moscow.
$ echo "Он жил в Москве." | tr [:upper:] [:lower:]
Он жил в Москве.
$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

noe avatar Jan 20 '20 19:01 noe

Interesting. Hmmm, so is that feature in the sacremoses CLI worth implementing?

@noe's point to https://stackoverflow.com/questions/13381746/tr-upper-lower-with-cyrillic-text/13383175#13383175 is right, on Ubuntu

$ echo "Έζησε στη Μόσχα." | tr [:upper:] [:lower:]
Έζησε στη Μόσχα.

$ echo "Έζησε στη Μόσχα." | sed 's/[[:upper:]]*/\L&/'
έζησε στη Μόσχα.

alvations avatar Jan 23 '20 09:01 alvations

To me, having the lowercasing in sacremoses CLI would be useful because:

  • It would relieve me from googling for the correct perl/awk one-liner every time I need to do it.
  • It would provide a point to fix context-dependent problems that probably shouldn't be fixed in awk/perl, like those described here (i.e. having a word-ending Σ should be converted to ς instead of σ)

noe avatar Jan 23 '20 10:01 noe

+1.

It would also be nice to chain operations, e.g.,

echo This is a test | sacremoses normalize [options] lowercase [options]...

mjpost avatar Mar 31 '20 19:03 mjpost

@mayhewsw @noe @mjpost No promises but lowercase is a low-hanging fruit. Lets see how far I get go by end of the week of this sprint =)


@mjpost good idea on pipelining. Any other interface to follow? Anyone can point to similar pipelining interface in CLI? Maybe it should start with how we want to do in within Python first then move to CLI?

References:

  • https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.Pipeline.html
  • https://huggingface.co/transformers/main_classes/pipelines.html
  • https://spacy.io/usage/processing-pipelines

alvations avatar Apr 13 '20 04:04 alvations

@mjpost Good news on chaining the commands for pipelining https://click.palletsprojects.com/en/7.x/commands/#multi-command-pipelines =)

Gonna be a fun Tuesday tomorrow, implementing this!!

alvations avatar Apr 13 '20 12:04 alvations

Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe

I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.

Anyone knows how UNIX does streams and pipes? Any pointers?

alvations avatar Apr 14 '20 04:04 alvations

All the processors are generators. So at the top level you should be able to just pass one sentence at a time through each of them, right? I don’t see what about this requires you to load all the data (but I agree that you cannot do that!)

matt

On Apr 14, 2020, at 00:08, alvations [email protected] wrote:

Here's some updates on a POC on the pipeline, it seems like doing any simplistic stdin pipelining with click requires some full storage of the data into some memory first. https://github.com/alvations/warppipe https://github.com/alvations/warppipe I'm not sure how UNIX do it but keeping stdin / stdout in memory might be painful when corpus is rather huge. Currently, if we do the processing stepwise, streaming in and out, theoretically nothing would be kept in memory but I/O time is costly since we have to save the stdout to somewhere.

Anyone knows how UNIX does streams and pipes? Any pointers?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/alvations/sacremoses/issues/82#issuecomment-613214949, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADPEWBM5EMDZVQA3MBGTBDRMPOUFANCNFSM4KGTUVCQ.

mjpost avatar Apr 14 '20 12:04 mjpost

Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.

From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe

P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.

alvations avatar Apr 14 '20 22:04 alvations

Loading into RAM by default is a huge mistake. You are introducing a hardware constraint where there doesn’t need to be one. IMO it’s not worth the complexity to even permit preloading, just to save a few seconds. It doesn’t matter.

matt (from my phone)

Le 14 avr. 2020 à 18:20, alvations [email protected] a écrit :

 Maybe there's some usefulness in loading the whole dataset into memory instead of processing one sentence at time. Empirically it seems to be a few seconds faster on a dataset that takes 20-30 seconds to tokenize. Maybe this should be an option too ---load-in-ram or something. Given that processing usually gets done of servers with much more RAM than plain text data (nowadays), this isn't a problem.

From some playing around the UNIX CLI, it looks like it's processing the full pipeline by chunks instead of performing the processes sequentially. Got to look at this a little more carefully. https://linux.die.net/man/7/pipe

P/S: Fixing last issues with the kwargs things in click and the pipeline feature should be good to go for a PR.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or unsubscribe.

mjpost avatar Apr 14 '20 23:04 mjpost

With pipeline feature out of the way, coming back to lowercase, any ideas/suggestions of what options one would need for sacremoses lowercase?

I guess with the pipeline global, the lowercase command would look something like this:

cat big.txt | sacremoses -j 4 -l en lowercase [OPTIONS] 

alvations avatar May 04 '20 01:05 alvations

I can't think of any options for lowercase. That looks good above.

mjpost avatar May 04 '20 12:05 mjpost

I wonder if a "reverse lowercase" option would be useful. Sometimes you want everything in upper case.

mayhewsw avatar May 04 '20 12:05 mayhewsw

@mayhewsw I can't think of a frequent NLP usecase where everything needs to be uppercase. What did you have in mind?

bricksdont avatar May 06 '20 10:05 bricksdont

I agree that it's not frequent, but sometimes it's useful, and if the pipeline is already there, it shouldn't be hard to add w.upper(). One example: in this paper the authors wanted to create all upper case training data for robustness to NER.

mayhewsw avatar May 06 '20 13:05 mayhewsw

There's something better coming up, upper, lower and a surprise. But it'll take a couple of days to free myself up for some more coding and finishing up the feature =)

alvations avatar Jun 04 '20 00:06 alvations