ruimtehol icon indicating copy to clipboard operation
ruimtehol copied to clipboard

Need some testers

Open jwijffels opened this issue 5 years ago • 22 comments

I am looking for testers to test out the package to compare it to other approaches. Inviting @jlacko I see you used keras in this blog post http://www.jla-data.net/eng/vocabulary-based-text-classification/ it would be Nice to compare it to the embed_tagspace model. What do you think?

jwijffels avatar Jan 29 '19 18:01 jwijffels

I would be delighted to!

I am a great fan of your udpipe package. The lemmatization feature is a godsend for people doing text analysis in languages with complicated grammatical inflections (like Czech). The Engish speaking people have it easy.

I wrote the blog post you mention in response to a question on RStudio forum about the best tokenizer for keras (https://community.rstudio.com/t/what-is-the-best-tokenizer-to-be-used-for-keras/) because I honestly believe it really is the best tokenizer for keras :)

I am not familiar (yet) with the ruimtehol package, but I will familiarize myself with it and give you feedback / suggestion / whatever on what I find.

jlacko avatar Jan 29 '19 20:01 jlacko

All credit regarding udpipe should really go to @foxik who created the UDPipe C++ library. No doubt about it he tuned the model well for Czech. Looking forward to see your feedback on the ruimtehol to keras comparison. Fyi there is more info on the ruimtehol package which wraps starspace at http://www.bnosac.be/index.php/blog/86-neural-text-modelling-with-r-package-ruimtehol

jwijffels avatar Jan 29 '19 21:01 jwijffels

I need a bit time to go through the embed_tagspace feature of ruimtehol package. My first impression is that it is fast.

Both in the sense that it trains faster than Keras, and that it is faster to set up a model, as it does not require extensive preprocessing. Thumbs up for that.

But I need to develop some toy example to benchmark it against Keras for accuracy; this will take some time.

I will report back once I am ready, but I can already tell that I am intrigued :)

J.

jlacko avatar Jan 31 '19 18:01 jlacko

Take your time, it also took me Some time to get familiar with the Starspace methodology and the training parameters and format. I only did training on larger datasets myself. My starting point advice of a toy example dataset is that it should contain at least 25000 records of text.

jwijffels avatar Jan 31 '19 19:01 jwijffels

Hi @systats Are you interested in chiming in on this exercise comparing different text modelling approaches for modelling purposes. @jlacko made a repository for doing a basic simple balanced data 6-label classification problem at https://github.com/jlacko/celebrity-faceoff Results from that setup were basically:

  • penalised multinomial regression with glmnet
    • 74.39%
  • naive bayes model with quanteda
    • 75.78%
  • fasttext
    • 74.11%
  • ruimtehol (Starspace)
    • vanilla: 71.39%
    • tuned: 73.28%
    • with transfer learning: 75.11%
  • keras
    • Single LSTM: 75.83%
    • Stacked LSTM: 76.44%

I have a political dataset of questions/answers in Belgian parliament at https://github.com/bnosac/ruimtehol/raw/master/inst/extdata/dekamer-2014-2018.RData Would love to see other approaches and results popping out comparing different techniques on different types of datasets (preferable larger volumes of text) supervised as unsupervised. I've made an overview of NLP techniques at http://bnosac.be/index.php/blog/87-an-overview-of-the-nlp-ecosystem-in-r-nlproc-textasdata.

jwijffels avatar Feb 19 '19 16:02 jwijffels

Hi guys: sounds very good to gather the collective wisdom on NLP. Altough 9000 tweets seem to be a little bit few data at least for keras. But sure I will look into the project.

systats avatar Feb 25 '19 10:02 systats

@systats If you have other datasets which you think are interesting, let us know.

jwijffels avatar Feb 25 '19 10:02 jwijffels

Honestly i don't see value in this exercise. For text classification almost always we can get nearly maximum accuracy using bag-of-words/bag-of-ngrams and logistic regression.

On Mon, Feb 25, 2019 at 2:24 PM jwijffels [email protected] wrote:

@systats https://github.com/systats If you have other datasets which you think are interesting, let us know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bnosac/ruimtehol/issues/11#issuecomment-466957689, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3TVtcwkxWVcPuHkxuM1-WfRI_ZU-ks5vQ7nrgaJpZM4aYrtP .

-- Regards Dmitriy Selivanov

dselivanov avatar Feb 25 '19 10:02 dselivanov

Just to clarify. I believe there are many interesting things we can do with current set of NLP related packages in R. Many thanks to Jan! But text classification is really most boring and can be considered as "solved".

On Mon, Feb 25, 2019 at 2:27 PM Dmitriy Selivanov < [email protected]> wrote:

Honestly i don't see value in this exercise. For text classification almost always we can get nearly maximum accuracy using bag-of-words/bag-of-ngrams and logistic regression.

On Mon, Feb 25, 2019 at 2:24 PM jwijffels [email protected] wrote:

@systats https://github.com/systats If you have other datasets which you think are interesting, let us know.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bnosac/ruimtehol/issues/11#issuecomment-466957689, or mute the thread https://github.com/notifications/unsubscribe-auth/AE4u3TVtcwkxWVcPuHkxuM1-WfRI_ZU-ks5vQ7nrgaJpZM4aYrtP .

-- Regards Dmitriy Selivanov

-- Regards Dmitriy Selivanov

dselivanov avatar Feb 25 '19 10:02 dselivanov

That is certainly what is coming out of the above exercise and it's always the 1st approach I try out. But I'm not 100% sure on that statement especially for larger datasets and for multi-label classification. Either way I would be mainly interested in some embedding comparisons, because general entity embeddings is what this Starspace model is doing. If you have ideas on datasets & approaches, let me know.

jwijffels avatar Feb 25 '19 10:02 jwijffels

sure, there are probably to many

But it strongly depends on what you are personally interested in.

systats avatar Feb 25 '19 11:02 systats

I too think classification and regression is well documented (in other languages) everywhere, but some specific problems remain difficult (unbalanced data/ multi-outcome). I'd love to work on projects involving Seq2Seq models (text generation, summarization, autoencoders for topic models), but there is little R documentation out there.

I thought it might be useful to build a unifying wrapper for all first party algorithms in R (like RTextTools). This would enable us to benchmark (purrr) all the datasets and models at once.

systats avatar Feb 25 '19 11:02 systats

FWIW. Starspace handles multi-outcome unbalanced classification data pretty gracefully.

I'm in if you want to compare a seq2seq model to a combination of textrank with some embedding models (either starspace/fasttext or glove) in order to summarise data. Do you have some data in mind which makes this a nice exercise. Or I don't mind doing the exercise of using starspace in a similar fashion as an autoencoder to provide labels/categories to images.

jwijffels avatar Feb 25 '19 11:02 jwijffels

Generally, I am interested in all the topics you mentioned. Some weeks ago I tried to build a Keras summariser but the predictions were bad. From my point of view, Keras is not the best suited library for this task, because one has to iterate whole parts of a model per token. textgenrnn might be easier for predicting next words in a sequence.

At the moment I have to write a paper about insights from large collections of publications (N > millions), regarding a specific research field. Through unsupervised document clustering topicmodels (too slow)/ autoencoder (very complex), I hope to recover hidden structures in the corpus. Do you think starspace allows me to map the documents based on their vocabulary to a latent sub space [10 <= k <= 30], that I can access and work with? E.g. for predictions on new data. There is a package for Keras Autoencoders in R ruta which does exactly that. But I have to figure out how to adapt the model to text input. Maybe you guys have some suggestions on this topic.

Finally, I added three model which perform okay on the celebrity-faceoff dataset like yours. And there are hundreds more to discover. Our results strongly depend on tokens that appear only once in the corpus, and therefore the models overfit massively?

As supervised learning is still the backbone of most NLP applications, this effort could be scaled. We could standardize model inputs/interfaces, to easily run grid searches or special optimizers for hyperparameter selection. This would be also beneficial for automatic benchmarking models on a variety of datasets (comparing all R packages?) . This article suggest that the choice of dataset strongly matters for how a specific model performs (maybe starspace slightly underperforms only on small datasets?). Maybe you could provide pretrained word-embeddings for keras?

systats avatar Feb 26 '19 12:02 systats

Even more general (and better) than that: https://github.com/systats/tidykeras

systats avatar Feb 26 '19 12:02 systats

About the collection of publications. I haven't tried using the Starspace embeddings for clustering but it can give embeddings of words as well as sentences or paragraphs or full documents or authors which you can then cluster with DBscan e.g. or visualise with the projector R package. The main advantage of starspace is that the embeddings of the labels and ngrams which construct the embedding space are lying in the same space and that it works on a multitude of setups (it's a generic entity embedding framework) Interesting package https://github.com/systats/tidykeras. About a generic package for tuning nlp models. That would be interesting to have but out of my personal scope. Is that collection of publications public domain?

jwijffels avatar Feb 26 '19 14:02 jwijffels

Thanks for your kind words & for your work on the celebrity-faceoff repo. I managed to have only a quick look at it and it looked inspiring indeed. If you do not mind I will rework it from rmd format to plain R code to keep it more in line with the other sample data.

Unfortunately it seems (with the benefit of hindsight) that my choice of dataset was not a particularly good one. As intriguing as the idea of keeping up with the Kardashians seemed the sample was just too small.

With regards of using StarSpace algorithm on millions of documents - this seems interesting, and could work. I was amazed by the speed with which ruimtehol / StarSpace worked, especially since I was used to working with Keras. Totally different ballpark!

Finally, I added three model which perform okay on the celebrity-faceoff dataset like yours. And there are hundreds more to discover. Our results strongly depend on tokens that appear only once in the corpus, and therefore the models overfit massively?

jlacko avatar Feb 26 '19 14:02 jlacko

The main advantage of starspace is that the embeddings of the labels and ngrams which construct the embedding space are lying in the same space and that it works on a multitude of setups (it's a generic entity embedding framework)

Are starspace embeddings trained unsupervised? And do you think they provide a meaningful latent space (compared to topic proportions in topicmodels)?

Is that collection of publications public domain?

Personally I'm interest in social science literature, which comes partly from scopus (a tone of labels and meta data) and a free doi dump (for abstracts).

Of course you can reshape the file to your needs. Whats next for the celebrity-faceoff repo?

systats avatar Feb 26 '19 17:02 systats

Are starspace embeddings trained unsupervised? And do you think they provide a meaningful latent space (compared to topic proportions in topicmodels)?

They can be trained supervised, unsupervised or semi-supervised. Haven't compared them to LDA/BTM models yet. Currently only used it for document proposal and multi-label classification only. @systats I'll add some examples in the documentation on semi-supervised learning and also on transfer learning.

jwijffels avatar Feb 26 '19 17:02 jwijffels

I will of course look into it in the coming weeks. Thanks so far and your project inspired this: https://github.com/systats/textlearnR . I'm going to focus on this lightweight R package for benchmarking NLP models on a range of datasets. Hoping not to reinvent the wheel. If you see it that way let me know please.

systats avatar Feb 26 '19 17:02 systats

I am open to advice in this matter. My current plan is to consolidate the various scripts to a more uniform format and publish a short summary post on some blog or other.

Even though it proved too small to let the neural models truly shine I believe it could serve as a nice showcase of possible approaches.

Whats next for the celebrity-faceoff repo?

jlacko avatar Feb 26 '19 17:02 jlacko

I think there is certainly some value in comparing different approaches and best practices across different packages. I know there is https://github.com/tidymodels/textrecipes by @EmilHvitfeldt which tries to standardise NLP data preparation but mainly for non-neural supervised approaches. I'm not aware of similar neural tuning/comparison approaches on the R side. For datasets at textlearnR - last week I used the CRAN DESCRIPTION files for seeing which packages could be added to taskviews (https://github.com/jwijffels/starspace-examples) I think that an approach where different package authors in NLP can publish an Rmd file on their best practices in a certain NLP area will certainly be a good thing for the nlp r community.

jwijffels avatar Feb 26 '19 19:02 jwijffels