kwx icon indicating copy to clipboard operation
kwx copied to clipboard

Remove ngrams and topic number

Open AhmetCakar opened this issue 3 years ago • 9 comments

Hi Andrew, again me :) I want to ask two questions about the algorithm. When using the first BERT model, why are we remove ngrams and can't we use them without remove ngrams? My second question is that when using BERT we give the number of keywords and the number of topics. How does the number of threads work, so what is the logic?

AhmetCakar avatar May 22 '21 17:05 AhmetCakar

@AhmetCakar, hi again :)

You could certainly try BERT without removing the n-grams, but I've found that kwx works better when they're removed. BERT is able to pick up semantics from sentences, so it's actually better if they're less cleaning and nothing added that's not in them originally. Basically we don't need to add in n-grams as BERT's able to find the relationships of the words from context itself - we don't need to add in word tokens representing these representations. I honestly think that some steps in the kwx cleaning process might even be too much for a BERT model - that maybe for BERT we should just be using the raw uncleaned texts. This could be something you could try :)

For your second question, could you explain what you mean by number of threads? Would be happy to get back to you with a bit more background on what that is and how it's confusing.

andrewtavis avatar May 23 '21 20:05 andrewtavis

Hello Andrew, I'm learning LDA and BERT and KEYWORD extraction. You apply all in your algorithm, which is great work.

I would like you to help in understanding some of your code.

What is the purpose of this piece of code: import os import sys

import numpy as np import pandas as pd

from kwx.utils import load_data, prepare_data from kwx.utils import organize_by_pos, translate_output from kwx.model import extract_kws, gen_files from kwx.visuals import graph_topic_num_evals, pyLDAvis_topics from kwx.visuals import gen_word_cloud, t_sne

import matplotlib.pyplot as plt import seaborn as sns

sns.set(style="darkgrid") sns.set(rc={"figure.figsize": (15, 5)})

pd.set_option("display.max_rows", 16) pd.set_option("display.max_columns", None) from IPython.core.display import display, HTML

display(HTML(""))

Another Thing: Suppose I want to apply your code in ARABIC tweets. Would that will work.

Lastly: I would like to apply in a set of documents. Can you refer to me helpful resources?

I really appreciate any help you can provide.

Keamww2021 avatar Nov 07 '21 09:11 Keamww2021

Hi @Eman-2021-PhD :) Thanks for your compliments and your questions!

First question: I'm assuming that the code that you're referring to is the imports at the top of examples/kw_extraction, but correct me if I'm wrong. The code is the imports of what's needed for running the notebook - everything from kwx, pandas, numpy, and the plotting packages - and along with that are some notebook specific imports that I always put at the top of my Jupyter notebooks. Again I'm assuming the notebook specific imports are what's confusing. Here's a rundown of those :)

sns.set(style="darkgrid")
sns.set(rc={"figure.figsize": (15, 5)})

The above sets the background style of the plots with seaborn, and also determines how big all the plots will be. You can see the output in the Graph of Topic Number Evaluations section.

from IPython.core.display import display, HTML
display(HTML("<style>.container { width:99% !important; }</style>"))

I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear).

Second question: I'm very much hoping that kwx can work for Arabic, and wish you luck on your project. In the example you should just need to change the languge to "arabic" like so:

from kwx.utils import prepare_data

input_language = "arabic" # see kwx.languages for options

# kwx.utils.clean() can be used on a list of lists
text_corpus = prepare_data(
    data="df_or_csv_xlsx_path",
    target_cols="cols_where_texts_are",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

input_language would then also need to be passed to extract_kws as in the examples. kwx doesn't allow for lemmatisation for Arabic, so it instead will stem the words using NLTK's SnowballStemmer("arabic").

Third question: I'm not sure about resources, but you should be able to apply kwx to a set of documents directly. All you'd need to do is set up a list of the texts or put them into a pandas dataframe. Say that you have a dataframe df_arabic_texts where each row is a different text that can be found in the column "texts". In the above example you'd do:

from kwx.utils import prepare_data

input_language = "arabic"
text_corpus = prepare_data(
    data=df_arabic_texts,
    target_cols="texts",
    input_language=input_language,
    min_token_freq=0,  # for BERT
    min_token_len=0,  # for BERT
    remove_stopwords=False,  # for BERT
    verbose=True,
)

You'd then use text_corpus in extract_kws as directed by the examples.

Hope that the above helps :) Let me know if you have further questions, and again good luck!

andrewtavis avatar Nov 07 '21 10:11 andrewtavis

Many thanks for your reply. Your explanation is very clear and helpful.

On Sun, 7 Nov 2021 at 1:02 PM Andrew Tavis McAllister < @.***> wrote:

Hi @Eman-2021-PhD https://github.com/Eman-2021-PhD :) Thanks for your compliments a your questions!

First question: I'm assuming that the code that you're referring to is the imports at the top of examples/kw_extraction https://github.com/andrewtavis/kwx/blob/main/examples/kw_extraction.ipynb, but correct me if I'm wrong. The code is the imports of what's needed for running the notebook - everything from kwx, pandas, numpy, and the plotting packages - and along with that are some notebook specific imports that I always put at the top of my Jupyter notebooks. Again I'm assuming the notebook specific imports are what's confusing. Here's a rundown of those :)

sns.set(style="darkgrid")sns.set(rc={"figure.figsize": (15, 5)})

The above sets the background style of the plots with seaborn, and also determines how big all the plots will be. You can see the output in the Graph of Topic Number Evaluations section.

from IPython.core.display import display, HTMLdisplay(HTML(""))

I'm noticing that the above should be together rather than separated by a line, and just fixed that. What this does is it expands the display of your Jupyter notebook to close to the full width of the screen so that you have more space to work with. You first import the Jupyter (IPython) notebook's ability to interact with the HTML display, and then you set the width to 99% of the width of the screen (I've found that 100% can cause the scroll bar to disappear).

Second question: I'm very much hoping that kwx can work for Arabic, and wish you luck on your project. In the example you should just need to change the languge to "arabic" like so:

from kwx.utils import prepare_data input_language = "arabic" # see kwx.languages for options

kwx.utils.clean() can be used on a list of liststext_corpus = prepare_data(

data="df_or_csv_xlsx_path",
target_cols="cols_where_texts_are",
input_language=input_language,
min_token_freq=0,  # for BERT
min_token_len=0,  # for BERT
remove_stopwords=False,  # for BERT
verbose=True,

)

input_language would then also need to be passed to extract_kws as in the examples. kwx doesn't allow for lemmatisation https://en.wikipedia.org/wiki/Lemmatisation for Arabic, so it instead will stem https://en.wikipedia.org/wiki/Stemming the words using NLTK's https://github.com/nltk/nltk SnowballStemmer("arabic").

Third question: I'm not sure about resources, but you should be able to apply kwx to a set of documents directly. All you'd need to do is set up a list of the texts or put them into a pandas dataframe. Say that you have a dataframe df_arabic_texts where each row is a different text that can be found in the column "texts". In the above example you'd do:

from kwx.utils import prepare_data input_language = "arabic"text_corpus = prepare_data( data=df_arabic_texts, target_cols="texts", input_language=input_language, min_token_freq=0, # for BERT min_token_len=0, # for BERT remove_stopwords=False, # for BERT verbose=True, )

You'd then use text_corpus in extract_kws as directed by the examples.

Hope that the above helps :) Let me know if you have further questions, and again good luck!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andrewtavis/kwx/issues/39#issuecomment-962582135, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWHZTGUXZ3O4CLLHTVSWQQLUKZFDRANCNFSM45K2TVRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Keamww2021 avatar Nov 07 '21 12:11 Keamww2021

You're very welcome!

andrewtavis avatar Nov 07 '21 15:11 andrewtavis

Hello Again, Could you please explain the function of this code: freq_kws = extract_kws( method="frequency", bert_st_model=None, text_corpus=text_corpus, input_language=input_language, output_language=None, num_keywords=num_keywords, num_topics=num_topics, corpuses_to_compare=None, ignore_words=None, prompt_remove_words=False, )

do you have any idea about how to extract "the top keywords after extract all the keywords" , i mean how to extract the top keywords that have high frequency.

Thanks in advance,

‫في الأحد، 7 نوفمبر 2021 في 6:38 م تمت كتابة ما يلي بواسطة ‪Andrew Tavis McAllister‬‏ @.***‬‏>:‬

You're very welcome!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andrewtavis/kwx/issues/39#issuecomment-962632466, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWHZTGTA5SFYV56YPECJA3DUK2MOJANCNFSM45K2TVRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Keamww2021 avatar Nov 10 '21 21:11 Keamww2021

Hi @Eman-2021-PhD :)

method="frequency" is just going to return the words that occur the most in the documents, which can be considered to be keywords in a simplistic sense.

One way that you could extract high frequency keywords is to run kwx twice over your documents: once with method="LDA" or method="BERT"; and a second time with method="frequency". You could then compare the outputs and take only those words from the first run that also appear in the second :) You might need to increase the value for num_keywords in extract_kws so that you get enough words that overlap in the two runs, but it definitely will work. This is actually an example of one of the original use cases for kwx :D

Let me know if you have further questions about the arguments in extract_kws.

All the best!

andrewtavis avatar Nov 10 '21 23:11 andrewtavis

Thank you for the clarification.

On Thu, 11 Nov 2021 at 2:03 AM Andrew Tavis McAllister < @.***> wrote:

Hi @Eman-2021-PhD https://github.com/Eman-2021-PhD :)

method="frequency" is just going to return the words that occur the most in the documents, which can be considered to be keywords in a simplistic sense.

One way that you could extract high frequency keywords is to run kwx twice over your documents: once with method="LDA" or method="BERT"; and a second time with method="frequency". You could then compare the outputs and take only those words from the first run that also appear in the second :) You might need to increase the value for num_keywords in extract_kws so that you get enough words that overlap in the two runs, but it definitely will work. This is actually an example of one of the original use cases for kwx :D

Let me know if you have further questions about the arguments in extract_kws.

All the best!

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/andrewtavis/kwx/issues/39#issuecomment-965822356, or unsubscribe https://github.com/notifications/unsubscribe-auth/AWHZTGQ6OTU2LITBFZZ3YWLULL235ANCNFSM45K2TVRA . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

Keamww2021 avatar Nov 11 '21 09:11 Keamww2021

You're very welcome, and further regards!

andrewtavis avatar Nov 11 '21 09:11 andrewtavis