torch-rnn icon indicating copy to clipboard operation
torch-rnn copied to clipboard

Modified preprocess.py to accept syllabic prediction...

Open dreavjr opened this issue 8 years ago • 6 comments

I'm playing with torch-rnn to do computational poetry (didn't Turing's own interest on IA started on that?!) and I found the letter-by-letter predictor requires really huge corpora (e.g., Shakespeare) to even start making sense, while the word-by-word predictor has limitations of its own. The syllable predictor converges quickly to something that... sounds correctly, even when it means nothing. It might be an interesting compromise between size-of-vocabulary vs amount of context for other explorations. The syllabic separation is based on PyHyphen, which uses LibreOffice's hyphenation dictionaries.

dreavjr avatar Apr 11 '16 01:04 dreavjr

Sounds great! Will check it out. It just might be a perfect compromise to make few of my small datasets (300-1000 KB) yield acceptable results.

ostrosablin avatar Apr 11 '16 18:04 ostrosablin

Hi, I had no problem to install PyHyphen with PIP as root (inside a docker container); but I talked to a colleague and he convinced me we should dump PyHyphen altogether and move to NLTK http://www.nltk.org/. I'm looking forward to attempt it.

Follow me : blog.eduardovalle.com - @DrEAVJr https://twitter.com/dreavjr - +EduardoValle https://plus.google.com/+EduardoValle/posts Follow us : recodbr.wordpress.com - @recodbr https://twitter.com/recodbr - facebook.com/recodbr https://www.facebook.com/recodbr

On 12 April 2016 at 06:01, Ostrosablin Vitaly [email protected] wrote:

I'm having trouble setting up pyhyphen module. I've installed it via pip, according to docs it should autoconfigure itself (If I understood correctly), but it doesn't set up config.py with proper values of repository, etc. As result, it gives 404 for me when attempting to install dicts. Config had placeholder value $repo for repository. I've tried replacing it with https://cgit.freedesktop.org/libreoffice/dictionaries/plain/dictionaries but it still doesn't download dicts.

— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/jcjohnson/torch-rnn/pull/64#issuecomment-208801846

dreavjr avatar Apr 14 '16 05:04 dreavjr

Yes, that's probably the best option, because NLTK is well maintained, while PyHyphen seems to be abandoned and partially broken.

I've tried to install dictionaries manually by downloading them from libreoffice git and pointing path variable in config.py to the directory with dicts, but it doesn't seem to work. I've installed it on Gentoo system. Have no idea why does it installs broken with PIP as root.

ostrosablin avatar Apr 15 '16 14:04 ostrosablin

I've modified your preprocessing module to make it work with another Python hyphenation library, Pyphen instead of PyHyphen. It has dictionaries built-in, so it has no problems like those I had with PyHyphen. It looks fine as far as I can tell, just need to check whether preprocessed datasets train to anything sensible. If anyone is interested - I could share my changes.

Because I mostly train networks on non-english texts with occasional english words, I thought it would make sense to use two hyphenators, one with specified language and one for en_US as fallback. In end, script selects a list with most items as basis for syllabic splitting. Because otherwise, hyphenator would fail to split syllables of one of two languages. For english texts, it would use a single hyphenator.

ostrosablin avatar Jun 19 '16 12:06 ostrosablin

Well, that worked quite fine. Network is converging into readability really quickly and on really small datasets. I suspect it might catch on to features more poorly, since dataset is smaller, but it works for me, because I use torch-rnn mostly for fun.

Here's my modified preprocess.py.

Update: There's still a problem that sampler is not aware of syllabic splitting and it will fail to pre-seed with -start_text. It's difficult to do something about that, because sampler is written in lua.

ostrosablin avatar Jun 20 '16 07:06 ostrosablin

ValueError: Word to be hyphenated may have at most 100 characters.

Maybe there should be workaround like this:

diff --git a/scripts/preprocess.py b/scripts/preprocess.py
index 4881bca..6e13359 100644
--- a/scripts/preprocess.py
+++ b/scripts/preprocess.py
@@ -63,7 +63,7 @@ if __name__ == '__main__':
                   space = False
                   continue
               if len(word)>0 :
-                  syls = separator.syllables(word.lower())
+                  syls = separator.syllables(word.lower()[:80])
                   if len(syls) == 0 :
                     syls = [ word.lower() ]
                   word = ''

vi avatar Aug 18 '16 01:08 vi