handson-ml icon indicating copy to clipboard operation
handson-ml copied to clipboard

Chapter-2 code

Open shwetacs12 opened this issue 6 years ago • 1 comments

Hi Ageron,

I'm new to ML and recently started reading your book. I have below question from Chapter 2.

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
     ids = data[id_column]
     in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
     return data.loc[~in_test_set], data.loc[in_test_set]

1- In the above code, why we use same variable "id_" twice once with lambda and second with test_set_check function?
2- Exactly what value of "id_" is passing to "test_set_check" function? We didn't initialize any value to it anywhere.

Please respond in detail.

Thanks Shweta

shwetacs12 avatar Nov 19 '19 05:11 shwetacs12

Hi @shwetacs12 ,

Thanks for your question! If you're not familiar with lambda functions, I strongly recommend you go through the entire Python 3 tutorial, especially the section about lambdas, and how scopes work.

Here's a short example. Let's define a function to compute the square of a number:

def square(x):
    return x*x

print(square(5)) # prints 25

Another way to define this function is using a lambda:

square = lambda x: x * x
print(square(5)) # prints 25

The main differences between a lambda function and a regular function is that the lambda function can only contain a single expression, and you don't explicitly use return (it is implicit). There are other less important differences, such as the fact that the lambda function's name is <lambda> rather than square, but in this case it does not matter.

So we could use a regular function rather than a lambda function in the book's code, perhaps it will make things clearer for you:

def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
    def is_in_test_set(id_):
        return test_set_check(id_, test_ratio, hash)

     ids = data[id_column]
     in_test_set = ids.apply(is_in_test_set)
     return data.loc[~in_test_set], data.loc[in_test_set]

As you can see, I replaced the lambda function with a regular function named is_in_test_set. It takes a single argument id_. Also, since it's defined within the split_train_test_by_id function, it has access to all the variables in that function, and it uses the variables test_ratio and hash (it's a bit as if these variables were implicitly passed as invisible arguments to the function).

So to answer your two questions:

  1. the first occurrence of id_ on that line is just required by the syntax of lambda functions: it is the name of the argument to the lambda function (it's equivalent to the first use of the name id_ in the definition of the function is_in_test_set). The second occurrence on that line is within the expression that will be evaluated when the function is called (it's the equivalent of the use of id_ inside the function is_in_test_set).
  2. The lambda function is passed as an argument to the apply() method of the ids Pandas Series. This method will call the lambda function once for each element of the ids Series.

Hope this helps!

ageron avatar Dec 03 '19 04:12 ageron