document_cluster
document_cluster copied to clipboard
IPython notebook stuff just gets in the way?
It took me a while to figure out what an ipython notebook was and how to open it and run the code.
Then I tried to create a pull request for my earlier issue, but when I opened the notebook in jupyter it apparently upgraded the notebook file to a more recent format, so my diff would have been huge.
I wanted to use this project in my own code, but it looks like I have to copy and paste snippets out of the notebook in order to do that?
I guess it's kind of cool to have that literate programming style, but mostly the notebook stuff just seems to get in the way and make life difficult. If you got rid of it and just had normal python code, it seems like this project would be significantly easier to work with.
Once you are in a notebook, you can actually save it as a python file. That will contain the code only.
You can also use nbconvert to convert the notebook into a number of different file formats.
I found download as -> python. Is that what you mean? Thanks, that's a step in the right direction. But there are still problems like "In [*]" comments scattered all over the code and imports in the middle of the file. This code was clearly meant to be run in the ipython notebook format and would need to be refactored to work outside of that environment. I will maybe do it at some point if I find the time.
Ok, I started on this: https://github.com/JesseAldridge/document_cluster
Note I found a big speed-up by by caching the stemming like so:
def cached_stem(t, cache={}):
if t not in cache:
cache[t] = stemmer.stem(t)
return cache[t]