Chapter-2 code
Hi Ageron,
I'm new to ML and recently started reading your book. I have below question from Chapter 2.
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
ids = data[id_column]
in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio, hash))
return data.loc[~in_test_set], data.loc[in_test_set]
1- In the above code, why we use same variable "id_" twice once with lambda and second with test_set_check function?
2- Exactly what value of "id_" is passing to "test_set_check" function? We didn't initialize any value to it anywhere.
Please respond in detail.
Thanks Shweta
Hi @shwetacs12 ,
Thanks for your question! If you're not familiar with lambda functions, I strongly recommend you go through the entire Python 3 tutorial, especially the section about lambdas, and how scopes work.
Here's a short example. Let's define a function to compute the square of a number:
def square(x):
return x*x
print(square(5)) # prints 25
Another way to define this function is using a lambda:
square = lambda x: x * x
print(square(5)) # prints 25
The main differences between a lambda function and a regular function is that the lambda function can only contain a single expression, and you don't explicitly use return (it is implicit). There are other less important differences, such as the fact that the lambda function's name is <lambda> rather than square, but in this case it does not matter.
So we could use a regular function rather than a lambda function in the book's code, perhaps it will make things clearer for you:
def split_train_test_by_id(data, test_ratio, id_column, hash=hashlib.md5):
def is_in_test_set(id_):
return test_set_check(id_, test_ratio, hash)
ids = data[id_column]
in_test_set = ids.apply(is_in_test_set)
return data.loc[~in_test_set], data.loc[in_test_set]
As you can see, I replaced the lambda function with a regular function named is_in_test_set. It takes a single argument id_. Also, since it's defined within the split_train_test_by_id function, it has access to all the variables in that function, and it uses the variables test_ratio and hash (it's a bit as if these variables were implicitly passed as invisible arguments to the function).
So to answer your two questions:
- the first occurrence of
id_on that line is just required by the syntax of lambda functions: it is the name of the argument to the lambda function (it's equivalent to the first use of the nameid_in the definition of the functionis_in_test_set). The second occurrence on that line is within the expression that will be evaluated when the function is called (it's the equivalent of the use ofid_inside the functionis_in_test_set). - The lambda function is passed as an argument to the
apply()method of theidsPandas Series. This method will call the lambda function once for each element of theidsSeries.
Hope this helps!