2017 icon indicating copy to clipboard operation
2017 copied to clipboard

Frequency transforms of text

Open danuep opened this issue 8 years ago • 4 comments

I didn't even get the idea until a couple of days ago, and mostly I'm hoping I can get this uploaded before midnight...

Reading @aparrish at #23 talk about hoping to get a meaningful average novel got me thinking about the scales of variation in play, which led to wavelet transforms, which led to

Haar of Darkness

which is unfortunately 2000 words short of the limit, so in honor of a brilliant woman of letters and a brilliant woman of numbers:

The Wavelets, a Daubechies transform of The Waves, by Virginia Woolf

[edit: now with correct link to The Wavelets]

danuep avatar Dec 01 '17 07:12 danuep

(now that I've slept)

I'm grateful to @aparrish for sharing her word vectors generated from Project Gutenberg. I wouldn't have had the time to pull this together without that resource. If I had more time, I'd go back and be more content-aware about tokenizing the source texts -- I split on spaces and at each non-letter character, and the vector file contains entries for tokens like '--' and contractions. Entertainingly enough, The Waves isn't in Project Gutenberg, and so my lookup error log was a nice list of words that she coined in that book. For those, I greedily matched valid sub-words starting from the beginning of the word.

I used JWave for the Haar and Daubechies transforms, and Annoy for the nearest-neighbor matching.

danuep avatar Dec 01 '17 14:12 danuep

🎈

Is the source available somewhere?

hugovk avatar Dec 01 '17 14:12 hugovk

I'll put it up later today--was mostly rushing to meet the deadline (which I now see was UTC, not local, so oh well).

danuep avatar Dec 01 '17 14:12 danuep

Scripts are up at https://github.com/danuep/nanogenmo2017

danuep avatar Dec 01 '17 22:12 danuep