Hello-Kaggle-Guide
Hello-Kaggle-Guide copied to clipboard
For someone who is new at Kaggle
Hello Kaggle!:wave:
I summarized the definitions of Kaggle and basic usage after reading Kaggle's Official Document and Kaggle Guide
I hope it will help those who are just introduced to Kaggle like me.
If there is anything that needs to be corrected, please leave it in Issue.
FYI, the Hello Kaggle' document rarely deals with Python programming or machine learning theory
and focuses on Kaggle usage.
For those of you who are looking for programming, data science, and machine learning materials, I'll leave you with some links that I've been helped with.
- DATA SCIENCE ROADMAP 2020
- data engineer roadmap by datastacktv
- My Data Science Online Learning Journey on Croursera
Table of contents
-
What is Kaggle?
- Kaggler? Kaggling?
- Kaggle Service and Features
- Required Kaggling Knowledge
- Prepare before becoming Kaggler
-
How is Kaggle used?
- Infrastructure for data analytics
- Notebook
- Dataset
- Company Training
- Discussion
-
Kaggle Competition?
- Featured, the most common Competition
- Research
- Getting Started for New Kaggler
- Playground for data scientists and engineers
- Recruitment for job opportunities
- Annual Competition held regularly
- Analytics to effectively explain the results
-
Getting Started with Kaggle
- Sign Up
- Take a look at Kaggle Courses
- Kaggle Tiers
- Medal
- Being Contributor
- Kaggle Rankings
-
Getting to know Notebook
- Introduction to Notebook
- What can you do with your Notebook?
- Create and use Notebook
- Various settings for Notesbook
- How to import data from Notebook
- Use external packages in Notebook
- Use Source Code from Dataset in Notebook
-
Competitions and Notebooks
- What else can the Notebook be used for besides data analysis Competition?
- How to handle Data Files to use in the Competition Notebook?
-
Competitions Progress Flow
- Baseline implementing the general purpose algorithm
- Data analysis notebook
- Fork Notebook
- Merge, Blending, Stacking, Ensemble Notebook
- Conclusion of Competitions Progress Flow
-
Rule of Competitions
- What rules should I check?
- What rules should I check?
-
Flow of Technology in Kaggle
- Exploring in Closed Competition
- Winner Solutions at a Glance
-
Kaggle Dataset and API
- Use Public Dataset
- Use it as a Data Repository
- Kaggle API
- Install Kaggle API
- Use Kaggle API
-
Finished!
What is Kaggle?
-
Kaggleis the platform that hosts the Data Analysis Competition. -
It is common for competitions to be hosted by providing data that needs to be analyzed for the company's
research challenges, key services.
-
Artificial Intelligence, Machine Learning Boomhas continued to increase the number of participants and was acquired by Google's parent company 'Alphabet' in 2017. -
Since the Alphabet's acquisition,
Kagglehas become a critical site for data scientists and engineers, not just a platform.
Kaggler? Kaggling?
- Like Google searches
Googling, > Kaggle's users areKagglerorKagglingto participate in the Competition.
Kaggle Service and Features
-
JobsJobs Servicewas originally provided, but the service ended on December 22, 2020.
Simply put, it's because the number of users is small.
For more information, read it here at https://www.kaggle.com/jobs-board-closed.
-
- Provides practical and practical lectures on
Python,machine learningandvisualization, and so on. Kaggle's coursecan be quite useful if you haven't learned it step by step or if you've studied an old course.- All lectures are also available in
English,freeand acertificateof completion.
- Provides practical and practical lectures on
English
-
Data scientists from all over the world gather together and use
Englishby default. -
Complementation Notice,Dataset,Discussionare also in English.
Below is the photo ofDiscussionandSite Forum.
-
If you look at the profiles of the winners of the Competition, there are a variety of
USA,Korea,Russia,China,India, and so on. -
Programming Language
- Generally use
PythonandRa lot.
- Generally use
Required Kaggling Knowledge
-
Purpose Knowledge Required Competition participation Python, R, data analysis Competition organizer Data analysis, English Discussion with Kaggler English Learning through Courses English
Prepare before becoming Kaggler
- Required:
Internet,PythonandR,PC - Recommended:
Server with GPUorWorkstationand high capacityHDDorSSD
How is Kaggle used?
Infrastructure for data analytics
- Kaggle is
web-basedand provides tools for data analysis. (Notebook) - Community with a variety of Kagglers to enable competition and cooperation.
Notebook
- The
programming environment for data analysisprovided by Kaggle. - A SaaS environment that runs code written on your Notebook on a server.
- It provides a programming environment, so there is no need to build a separate development environment. (No Python installation, Anaconda installation, etc.)
- It is similar to
Jupyter Notebook. - Provides
4 Core CPU + 16GB RAMby default.GPU Serverprovides2Core CPU + GPU + 13GB RAM.
Provided free of charge, andGPU can be used for 30 hours a week.
Dataset

- The first thing to do when developing a machine learning-based data analysis program is to prepare
Dataset. - Dataset is open for academic purposes or created and released by Kaggler.
- If you don't want to share your
Dataset, you can use thePrivatesetting to make it private to the outside world. - Once Dataset or Notebook is set to
Public,Apache 2.0 Licenseis applied, so you must make a careful decision.
Company Training
-
Example: staff training for creating neural network-based machine learning programs
-
- Sign up for Kaggle
-
- Employees are ready to copy and execute the moderator's Notebook
-
- Modifying a Neural Network Model in Notebook
-
- Submit the results of the modified model to Competition and check the score
-
-
What if we didn't use the Kaggle?
-
- Establishing a development environment on a training computer
-
- Distributing examples of machine learning programs (neural network models)
-
- Create a program to evaluate neural network model execution results by converting them into scores
-
- Check the evaluation score of the executed model
-
- Modifying a Neural Network Model
-
- Confirm that the score varies depending on the outcome of the run
- Confirm that the score varies depending on the outcome of the run
-
-
Kaggle is much easier and less expensive in
building a development environment,checking the score, anddeployment.
Discussion
-
If you don't know something, you can ask in
Site Forums, andCompetitionof theCommunities. -
Communities
-
Site Forums
Kaggle Competition?
Refer to Competitions Documentation.
Featured, the most common Competition

- Difficult competitions and generally commercial purposes.
- Most Kagglers participate in the competition, which has been held so far, the prize range is between
$100and$1,500,000.
Research

- It mainly deals with research topics and generally does not have prize money or rewards. (All the ongoing Research Competitions have prize money.)
- Instead, you can do research by discussing with less competitive and intellectually curious Kagglers.
Getting Started for New Kaggler

- The Competitions shown here are for beginners.
- Especially
Titanic: Machine Learning from Disaster,House Prices: Advanced Regression Techniques,Digit RecognizerThese three competitions are the most recommended and helpful competitions for new machine learners.
Playground for data scientists and engineers

- Competition is held mainly with topics that data scientists and engineers might find interesting.
- Playground is not an easy task. It usually covers recent academic/technical issues and public social issues.
- In some cases, the organizers may offer prize money or reward.
Recruitment for job opportunities

- Companies are hosting and a prize is mostly a
Job Interviewopportunity. Participants can upload a Resume at the end of the Competition.
Annual Competition held regularly
- Kaggle has several regularly held Competitions. You can find the following information at the current Kaggle.
Analytics to effectively explain the results
- This is not explained in Documentation, so I read and wrote the Analytics Competitions that are currently up there.
- Reading the evaluation and submission formats of each Competition, the scoring method of Analytics is shown by submitting a notebook directly and scoring by a person.
The analyzed data should be described by the organizers' requirements. It looks like a company persuading management through a presentation.
Getting Started with Kaggle
Sign Up
- Prior to starting Kaggle, click
Registerbutton on the upper right tosign upfirst.
Take a look at Kaggle Courses
- For those of you who do not have enough knowledge about machine learning or data analytics, it is also a good idea to study the areas you need at
Courses, as described above. - Each course consists of 2 to 8 classes and offers a variety of hands-on examples.
Refer to Kaggle Progression System.
Before I explain how to become a Contributor, I will explain about Kaggle Tiers and Medal.
Kaggle Tiers
-
There is a
Progression Systemin Kaggle, which is simplyKaggler Tier.
This rating is a good indicator of your ability as a data scientist.
It also intuitively shows how much you've grown. -
The
Kaggle Tiersare divided into five levels, and conditions are also given to achieve each.-
Novice
-
Contributor
-
Expert
-
Master
-
Grandmaster
-
-
Also, as you can see in the pictures above,
Kaggle Tieris rated differently forCompetitions,Datasets,Notebooks, andDiscussion. -
Click on the upper right account icon and select
My Profileto go to the profile page.
Then you can check your profile information and Kaggle activity content and tiers.
Medal
-
Medalshows Kaggler's performance in each field.- Kaggler with excellent results in
Competition - Kaggler writes and shares popular
Notebook - Kaggler shares useful
Dataset - Kaggler writes good
Comment
- Kaggler with excellent results in
-
Contributorjust needs to satisfy conditions. However, fromExpert, the medals required for the applicable conditions in each discipline must be collected. -
Competitionshave different medal criteria depending on the number of teams participating.
-
Datasets,Notebooks,Discussionare evaluated byVote. It means, the higher number ofVote, the more Kaggler recommended it.

-
Note that there is only one type of medal awarded for each post in each part.
For example, if a post onDatasetreceived 20 Votes, the bronze medal will be gone and the silver medal will be given.
Being Contributor
1. Adding User Profile Information
- Enter your profile, click
Edit Profile, and enter the following:Bio (self-introduction)OccupationOrganizationCity
- In addition, you can set
profile imageandSocial Mediafreely.
2. SMS Verification
- Click
Phone Verificationon the profile screen. - Check the
Country Code,Phone NumberandNot a Robotboxes and clickSend Code. - Enter the transmitted code and click
Verifyto complete authentication.
3. Run Script
- You can achieve this by learning at
Courseor by creating your ownNotebookand executing any code. 4. Participate in the Competitionwill run a notebook, so you can skip it.
4. Participate in the Competition
-
Select one Competition in the 'Getting Started' category.
-
If you go in, you can see the menu below in the middle of the screen.

-
Click on 'Notes' here and take a look at other people's notebooks.
-
Pick one notebook and open it in the upper right corner
You'll see a button like that. Click this button to copy the notebook.
-
Once the copy is complete, click
Save Versionat the upper right corner.Version Name: You can enter the name.Version Type: There are two options,Quick SaveorSave & Run All (Commit).Quick Saveis saved, not executed, andSave & Run All (Commit)is executed.
-
Click
Save & Run Allhere and press theSavebutton. -
Go back to your profile and click
Notebookto see the notebook you just copied.
When you click on this notebook, there isOutputat the right menu.
Select Submission.csv, which can be viewed by pressing Output, and clickSubmit to Competitionon the right. -
The screen will now be moved to the
Leaderboardmenu and the submitted files will be automatically scored.
After scoring, you can check your score and clickJump to your position on the leaderboardto see your ranking.
5. Comment to other people's posts or comments and cast upvote (Make 1 comment & Cast 1 upload)
- In
Discussion, enter the topic you want and click any article you are interested in (recommended to enterGetting StartedinSite Forums). - Read carefully and write
comments. If the text is useful or you like it, pressVoteas well.
6. Now you are a Contributor!
Wait!
- Let me add one more thing, Kaggle Rankings.
- Rankings are separated by
Competitions,Datasets,Notebooks, andDiscussion. - The photo below shows the ranking in the
Competitions. You can also check how many people are in each tier.
Getting to know Notebook
Please re-read here for a brief introduction to your Notebook!
What can you do with your Notebook?
- Programming for data analysis is the primary purpose, and programs created to run on the Kaggle server.
- Submit to
Competitionor shareNotebookwithKaggler. Some of theNotebooksare shared only for training or skills. - Use
Code CellandMarkdown Cellto write codes, and descriptions of the code, text, image, etc.
How to use Markdown
Markdown emoji-cheat-sheet
The above two links I referred to when I first used Markdown, and I still sometimes look at emoji whenever I need it.
Create and Use Notebook
-
Go to the
Notebookmenu and look in the upper right corner
There's a button like this. Click it.
-
Kaggle Notebookhas two types:ScriptandNotebook.Scriptis a method of writing and executing code in a commonly used code editor.
-
Notebookis an interactive development environment similar toJupyter Notebook. The characteristic is that you can divide the cells and execute only the code you want. -
Press
Filein the upper left corner and hover your cursor overEdit Typeto select the type. In addition, you can choose betweenPythonandRinLanguage.
-
You can change the name by clicking on the top left column that looks like the picture below.
-
The first time you create a
Notebook, you will see the following code:# This Python 3 environment comes with many helpful analytics libraries installed # It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python # For example, here's several helpful packages to load import numpy as np # linear algebra import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv) # Input data files are available in the read-only "../input/" directory # For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory import os for dirname, _, filenames in os.walk('/kaggle/input'): for filename in filenames: print(os.path.join(dirname, filename)) # You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" # You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current sessionThe above code specifies the directory
/kaggle/inputto import files after loadingNumpyandPandaslibraries fromPython. -
I will print
Hello Kaggle!onNotebook. Place the cursor in any code cell and press the+ Codebutton. -
Then complete the following:
-
At the top left
press this play button or
EnterCtrl + EnterorShift + Enterto execute the code. The output will be like this
-
These are the functions of the buttons that can be seen in the cell.
: Raise the cell position one space forward.
: Lower the cell position one space down.
: Deletes the corresponding cell.
/
: Hides or indicates that cell.
: provides the following additional features:
Various settings for Notebook
-
Set
Public&PrivateNotebookcan be released for sharing with otherKaggler. But if you don't want to share, or when you work as a team, you can make settings such asPrivateorShared to a specific user.- Press the
Sharebutton in the upper right corner to open a window forpublicorprivatesetting. - If
Privacyis set toPublic, it will be released withApache 2.0 License. - Use
Collaboratorsto add users as collaborators.
-
SettingsLanguage: You can set the programming language to usePythonandR.Environment: TheDockerimage can be set.Originalsets up the development environment when creatingNotebookandLatest Availableuses the latest development environment provided byKaggle.Accelerator: Whether to useGPUorTPUcan be set.GPU/TPU Quota: Show time and usage ofGPUandTPUInternet: You can set whether or not to connect to the Internet.
You can install certain packages by settingInternet to On. Google accounts also allow you to useBigQuery,Cloud Storage, andAutoMLservices fromGCP(Google Cloud Platform).
How to import Data from Notebook
-
Kaggle Notebookis available not only inCompetition Databut also in a variety ofDatasetshared.
In this case, a separate file must be set up for use inNotebook. -
- How to create a
new Notebook
- Go to the
Datasetyou want to use,
and press New Notebookto set the file automatically.
- How to create a
-
- How to add to an
existing Notebook
- To add new data to your
existing Notebook, first access yourNotebook.
Then click the
+ Add Databutton in the upper right corner.
Then a window appears where you search for the desiredDatasetand pressAddafter you chooseDataset.
- How to add to an
-
- How to upload yourself
- If you go into the
Datamenu and look in the upper right corner, click on the
+ New Databutton.
Then enter a name forEnter Dataset Titleand clickSelect Files to Uploadto upload the file. (Compressed file types such as zip or tar.gz are also possible.)
Finally, pressCreateto uploadDataset. You can import the uploadedDatasetusing theioriimethod.
-
- How to use output data from another
Notebook
- If you follow
iimethod, a window will appear, where you can click on theKernel Output Filestab to use the output data from anotherNotebook
- How to use output data from another
Use external packages in Notebook
-
External packages that
pipis avaliable can be installed withpip install package_nameby clickingConsoleat the bottom ofNotebook.
-
You can also use
pipdirectly in the code cell, as shown in two examples!pip install package_nameimport os os.system('pip install package_name')
Use Source Code from Dataset in Notebook
-
If you add
example datasetthat has packagehello_kaggletoNotebook, you can add the../input/example-dataset/hello_kaggledirectory.
The codes you add are as follows:import sys sys.path.append("../input/example-dataset/hello_kaggle")
Competitions and Notebooks
What else can the Notebook be used for besides data analysis Competition?
- In general, if the goal is to win a prize,
Notebookwill be shared(Public) afterCompetitionis finished.
However, there is also an environment in which we can discuss with Kaggler even whenCompetitionis in progress.
How to handle Data File to use in Competition Notebook?
-
When performing
Competition, theDatatab is located in the upper right corner of theNotebook. There are three types of files you can click on, each of which is described as follows.train.csv: Learning data with correct answer label.test.csv: Data for testing without the correct answer label.Sample_submission.csv: Examples of data for submission
-
View the
Datamenu inCompetitionto see what data each file contains.
For example, lets look at theTitanic - Machine Learning from Disaster.

In the picture above, click on the Data menu to readOverviewas follows

If you go down further, you can select each file to view the data and download it as follows

-
Let's use these files to create and submit a csv file for model creation and submission.
(The same is explained in 4. Participate in the Competition.)- Click
Save Versionin the upper right corner of theNotebookscreen. (If the code is not executed, clickSave & Run All (Commit). - In
Save & Run All (Commit),Commitis the same meaning asGit CommitinGithub, which I am currently working on.
Therefore,Kaggle Notebookcan refer to the version of the source code previously written.
- Click
-
Now return to your profile and click
Notebookto see the notebook you just saved.
When you click on this notebook, there isOutputin the right menu.
SelectSubmission.csvthat you can view by pressingOutputmenu and clickSubmit to Competitionon the right. -
The screen will now be moved to the
Leaderboardmenu and the submitted files will be automatically scored.
After scoring, you can check your score and clickJump to your position on the leaderboardto see your ranking.
Competitions Progress Flow
- The type and order that comes out here is the personal opinion of Toshiyuki Sakamoto, author of
Kaggle Guide.
Baseline implementing the general-purpose algorithm
- First, you start analyzing the data, you get the output data through a general-purpose algorithm.
- Develop machine learning models in earnest and compare output data and results from general-purpose algorithms.
- If the comparison results in a worse result than the general-purpose algorithm, you can assume that the model has a problem.
Data Analysis Notebook
- This refers to
Notebookthat analyzesCompetition dataand showsvisualization. - Focus on identifying
correlations,rules, andstructurebetween the analyzed data without creating data to submit. We also look forindependent variablesthat fit well withdependent variable. - If you have less
Competition experience, it would be a good start to build knowledge and insight by looking at data analyzed by otherKagglers.
Fork Notebook
- For those who are new to
machine learningandKaggle, one way is to fork out anotebookthat is open without data analysis or model development yourself. Forkmeans to copy a version of the source code.- On the top right of the
Notebookyou'd like to fork
press button to copy.
Merge, Blending, Stacking, Ensemble Notebook
Notebookwith words such asMerge,Blending,Stacking, andEnsemble.- As the name suggests, it means
Notebookcombining severalNotebooks. Example:
Conclusion of Competitions Progress Flow

- When
Competitionis carried out in this order, I think it would be better to study a variety ofNotebooksto understand the process rather than just looking at thewinner's notebook. - Also,
Competitionis literally a competition, so the shared(public)Notebookmeans that they are not serious impact on their score.
In fact, if you look at theNotebook of winners, you can often see that they used the latest technology or used a different solution than theshared notebook.
Rule of Competitions
Competitions in Kagglesometimes have specific rules. This is becauseCompetitionsare usually hosted by a company or organization, and special rules are often created to achieve the results that the company or organization wants.
What rules should I check?
-
Rules: To win theCompetition, you must first know therules of Competition. Check theRulesmenu for each Competition.
-
Evaluation: On theEvaluationpage ofOverview, you should look at theEvaluation functionand see what evaluation method is applied. Usually, statistical-based functions are used.
-
One-person score check limit: If you can check the score frequently by submitting a result file as you change the data one by one, the competition won't get any meaningful results, so there is usually a limit to the number of results checked.
-
Notebook Only Competition: Submit results usingKaggle Notebookonly.
If onlyKaggle Notebookis used,Kaggleris more likely to shareNotebook, and all participants can easily find good ideas by viewingshared Notebook.
Also, all participants have the same computing resources, which can help address inequality between those who use personal workstations and those who do not.
Flow of Technology in Kaggle
Exploring in Closed Competition
- One characteristic of
Kaggleis that it leavesdiscussionandnotebookofCompetition that ended a long time ago.
So if you look at these, you can see what technologies were applied to where and in what ways. - Example
Competition Used Technology Description Mercari Price Suction Cahllenge (2018.2) TF-IDF Vector + Pre-bonded Neural Network Learn the frequency of each word with neural networks Toxic Comment Classification Challenge (2018.3) FastText, Glove + GRU + LightGBM A combination of word vector dictionaries learned from time series data Avito Demand Prediction Challenge (2018.6) FastText + LSTM + 2D-CNN Learn data and images of sentences simultaneously with neural networks Quora Insincere Questions Classification(2019.1) Glove, para + OOV Token + LSTM + 1D-CNN Learn vocabularies through OOV token Jigsaw Unintended Bias in Toxicity Classification(2019.6) BERT + XLNet + GPT2 BERT model appeared to the Kaggle
Winner Solutions at a Glance
- Data-Science-Competitions is a Github repository, presents solutions that
won the Competitiontopic by topic (I just checked it out that 11 months ago was the last commit). - The winning solution is technology-based at the time, so we need to see if we have better technology today.
- Most
Competitionswill continue to release their latest technology-enabled solutions on thePrivate Leaderboardpage after the end.
Kaggle Dataset and API
Use public Dataset
- When studying common algorithms, it is recommended to test performance with a widely publicized
Dataset,UCI Machine Learning Repositoryis famous.
It is also used in many academic papers.
Use it as a Data Repository
- When using
Github, you can useKaggleas a convenient place to storeDatasetandNotebook(Free!) - It also has the advantage of being able to connect
Datasetdirectly toNotebook. - There is a capacity limit of up to 20GB per
public Datasetand up to 20GB total forall private Dataset.
Kaggle API
Kaggle APIis an API that can use various functions ofKagglein various development environments.- Developed as
Python 3and the usage is input command into the terminal environment.
Install Kaggle API
-
You must install
Pythonandpipbefore starting. -
- First, install
Kaggle APIusingpip install kaggle.
- First, install
-
2.Then enter your profile, click on the
button that looks like this, and press Accounts. -
3.

ClickCreate New API Tokenhere to download thejsonfile. -
- Save downloaded
jsonfile to the user's home directory as.kaggle/kaggle.json. now you are ready to useKaggle API.
- Save downloaded
Use Kaggle API
- You can open a terminal on your PC and run commands.
- Run the
kaggle competitions listcommand to see whichCompetitionsare currently in progress.

- To view and download
Competition files, check the file withkaggle competitions files COMPETITION_NAMEandkaggle competitions download COMPETITION_NAMEto download the files. - To learn more about the
Kaggle API, please visit Kaggle Public API Documentation.
Finished!
First of all, thank you for reading Hello Kaggle!
I studied Python for the first time in April 2020 and was unable to concentrate fully on my studies as I've started military service in July of the same year.
That's why I couldn't study data science in depth, and I still need more knowledge to understand it.
Now finally I'm stepping into machine learning and Kaggle.
At this moment to write Hello Kaggle!, I've improved my understanding of Kaggle and I'm going to start with Getting Started Competition.
Also eager to keep up with the latest technology by looking at other outstanding Kaggler's Notebook.
Hopefully, everyone who reads Hello Kaggle! will get the best time in 2021. Let's Keep Going!
