data_science_in_julia_for_hackers icon indicating copy to clipboard operation
data_science_in_julia_for_hackers copied to clipboard

Chp 04 Review

Open aloctavodia opened this issue 1 year ago • 1 comments

nitpick

  • [x] done

Change:

We all hate spam emails. How can Bayes help us with this? What we will be introducing in this chapter is a simple yet effective way of using Bayesian probability to make a spam filter of emails based on their content. There are many possible origins of the ‘Spam’ word. Some people suggest Spam...

Into:

We all hate spam emails. Some people suggest Spam...

Nitpick, consider rewording

  • [x] done

Change:

What we are facing here is a classification problem, which means we would like to group data (emails) into different categories (spam or ham). For this reaseon, it is important that our training data pre-classified (this step is frequently done by a human) so our model can “learn” to associate the target variable (email type) with the input variables (words contained in the email). For example, a successful model might infer from the training data that emails containing the word “discount” have a high probability of being spam. We will implement a solution with the help of Bayes’ theorem. What we are going to do is to treat each email just as a collection of words. This is why our methodology is called naive Bayes: The particular relationship between words and the context will not be taken into account here. Our strategy will be to estimate a probability of an incoming email of being ham or spam and making a decision based on that. Our general approach can be summarized as:

Into:

Given this data and Bayes' theorem, we will create a model that can "learn" to associate the target variable (email type) with the input variables (words contained in the email). For example, a successful model might infer from the training data that emails containing the word “discount” have a high probability of being spam.

We are going to treat each email just as a collection of words, without any relationship between them and no context where they appear, other than being part of a spam or ham e-mail. For this reason, this model is called naive Bayes. Our general approach can be summarized as:

Maybe:

Add a sentence or two saying something like... When creating models we usually take decisions to make our lives easier, and they usually come at the expense of making the model simpler. It is usually a good practice to start building simpler models and add complexity only as needed].

Also later in the same section, you say " We have to stress that this is not necessarily true, and most likely false. Words in a language are never independent of one another, but this simple assumption seems to be enough for the level of complexity our problem requires."

Maybe mention this earlier and later just say something like "we can multiply because of assumed independence"

First mention should be the previous chapter

  • [ ] done

"Where we use ∝ sign instead of = sign because the denominator from Bayes’ theorem is missing"

Redundant

  • [x] done

This was mentioned too recently, to say "remember". I think it can be safely omitted.

"we have to remember that we are interpreting each email just as a collection of words, with no importance on their order within the text. In this naive approach, the semantics are not taken into account."

Disconnected from previous and next paragraphs

  • [x] done

"The technique we applied to consider –in principle– two different words as the same is called lemmatization. It is a standard technique in linguistics which groups together inflected forms of a word, letting us consider ‘win’ and ‘won’, in our example, as the same word."

Can you have footnotes or into/tip boxes?

Detail,

  • [x] done

Maybe Change : "The multiplication of each of the word probabilities here stands from the supposition that all the words in the email are statistically independent."

Into:

"The multiplication of each of the word probabilities here stands from the supposition that all the words in the email are conditionally independent given the class (spam or ham)"

Split into two paragraphs, maybe extend the first one

  • [ ] done

First, we would like to filter some words that are very common in the English language, such as articles and pronouns, which will most likely add noise rather than information to our classification algorithm. For this we will use two Julia packages that are specially designed for working with texts of any type. These are Languages.jl and TextAnalysis.jl.

A good practice when dealing with models that learn from data like the one we are going to implement, is to divide our data into two: a training set and a testing set. We need to measure how good our model is performing, so we will train it with some data, and test it with some other data the model has never seen. This way we may be sure that the model is not tricking us. In Julia, the package MLDataUtils has some nice functionalities for data manipulations like this. We will use the functions splitobs to split our dataset in a train set and a test set and shuffleobs to randomize the order of our data in the split. It is important also to pass a labels array to our split function so that it knows how to properly split our dataset.

Note: I have the impression there are many places where sentences should have been split into two or more paragraphs. Maybe this is a rendering issue and the sentences are separated in the source?

Explain formulas in one or two sentences, consider making crossreference to the section in chapter 2.

  • [ ] done

The probability of finding a particular word in an email, given that we have a spam email, can be calculated like so:

rendering issue

  • [ ] done

You might be surprised to read the $α$ value in the equations.

How to compute the priors is not explained in the text

  • [ ] done

Spend more time explaining the results

  • [x] done

  • Explain what accuracy means (even if this can be inferred from the code)

  • Show the confusion matrix and discuss its values

  • Explain ham_accuracy and spam_accuracy, computation and interpretation of results

aloctavodia avatar Jan 16 '23 15:01 aloctavodia

As a general theme here and in other chapters. Readers may benefit from adding more structure to the text including more sections and more paragraphs and maybe some text moved to footnotes or boxes, or maybe more figures... I like the chapters being relatively short but sometimes the text seems that moves too fast from one idea to the next. Adding structure will help to reduce that feeling, and will give the readers the space to reflect on what they are reading. But in some parts, the text may need one or two more sentences.

aloctavodia avatar Jan 16 '23 15:01 aloctavodia