torch-rnn
torch-rnn copied to clipboard
Modified preprocess.py to accept syllabic prediction...
I'm playing with torch-rnn to do computational poetry (didn't Turing's own interest on IA started on that?!) and I found the letter-by-letter predictor requires really huge corpora (e.g., Shakespeare) to even start making sense, while the word-by-word predictor has limitations of its own. The syllable predictor converges quickly to something that... sounds correctly, even when it means nothing. It might be an interesting compromise between size-of-vocabulary vs amount of context for other explorations. The syllabic separation is based on PyHyphen, which uses LibreOffice's hyphenation dictionaries.
Sounds great! Will check it out. It just might be a perfect compromise to make few of my small datasets (300-1000 KB) yield acceptable results.
Hi, I had no problem to install PyHyphen with PIP as root (inside a docker container); but I talked to a colleague and he convinced me we should dump PyHyphen altogether and move to NLTK http://www.nltk.org/. I'm looking forward to attempt it.
— Follow me : blog.eduardovalle.com - @DrEAVJr https://twitter.com/dreavjr - +EduardoValle https://plus.google.com/+EduardoValle/posts Follow us : recodbr.wordpress.com - @recodbr https://twitter.com/recodbr - facebook.com/recodbr https://www.facebook.com/recodbr
On 12 April 2016 at 06:01, Ostrosablin Vitaly [email protected] wrote:
I'm having trouble setting up pyhyphen module. I've installed it via pip, according to docs it should autoconfigure itself (If I understood correctly), but it doesn't set up config.py with proper values of repository, etc. As result, it gives 404 for me when attempting to install dicts. Config had placeholder value $repo for repository. I've tried replacing it with https://cgit.freedesktop.org/libreoffice/dictionaries/plain/dictionaries but it still doesn't download dicts.
— You are receiving this because you authored the thread. Reply to this email directly or view it on GitHub https://github.com/jcjohnson/torch-rnn/pull/64#issuecomment-208801846
Yes, that's probably the best option, because NLTK is well maintained, while PyHyphen seems to be abandoned and partially broken.
I've tried to install dictionaries manually by downloading them from libreoffice git and pointing path variable in config.py to the directory with dicts, but it doesn't seem to work. I've installed it on Gentoo system. Have no idea why does it installs broken with PIP as root.
I've modified your preprocessing module to make it work with another Python hyphenation library, Pyphen instead of PyHyphen. It has dictionaries built-in, so it has no problems like those I had with PyHyphen. It looks fine as far as I can tell, just need to check whether preprocessed datasets train to anything sensible. If anyone is interested - I could share my changes.
Because I mostly train networks on non-english texts with occasional english words, I thought it would make sense to use two hyphenators, one with specified language and one for en_US as fallback. In end, script selects a list with most items as basis for syllabic splitting. Because otherwise, hyphenator would fail to split syllables of one of two languages. For english texts, it would use a single hyphenator.
Well, that worked quite fine. Network is converging into readability really quickly and on really small datasets. I suspect it might catch on to features more poorly, since dataset is smaller, but it works for me, because I use torch-rnn mostly for fun.
Here's my modified preprocess.py.
Update: There's still a problem that sampler is not aware of syllabic splitting and it will fail to pre-seed with -start_text
. It's difficult to do something about that, because sampler is written in lua.
ValueError: Word to be hyphenated may have at most 100 characters.
Maybe there should be workaround like this:
diff --git a/scripts/preprocess.py b/scripts/preprocess.py
index 4881bca..6e13359 100644
--- a/scripts/preprocess.py
+++ b/scripts/preprocess.py
@@ -63,7 +63,7 @@ if __name__ == '__main__':
space = False
continue
if len(word)>0 :
- syls = separator.syllables(word.lower())
+ syls = separator.syllables(word.lower()[:80])
if len(syls) == 0 :
syls = [ word.lower() ]
word = ''