Kannan Balakrishnan Teaches Data Science: The Titanic Challenge Step by Step in Python: part I introduction

One of the basic data science learning challenges that a budding data scientist is interested is the titanic challenge of Kaggle.

https://www.kaggle.com/c/titanic

Two files are given as data

1. Train.csv

The tragedy of the Titanic ship is well known. This challenge gives a training data having the following fields.

survival This has two values 0 and 1. 0 means person did not survive and 1 means person has survived
Pclass Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
Sex Sex of the passenger
Age Age in years
Sibsp number of siblings / spouses aboard the Titanic
Parch number of parents / children aboard the Titanic
Ticket Ticket number
Fare Passenger fare
Cabin Cabin number
embarked Port of Embarkation (has values C = Cherbourg, Q = Queenstown, S = Southampton)

2.Test.csv

The test.csv has also data about passengers all the above fields are present except Survival.

The challenge is to use machine learning to predict the survival of the passengers in test.csv.

Finally, we have to submit a file to Kaggle having only two fields. Passenger ID and survived. That is we have to predict the survival of all the passengers in test.csv and submit to Kaggle as CSV file.

We will be using a Kaggle kernel to solve this task.

The kernel provides you with an environment like your Jupiter notebook and you can work from anywhere because your code and data are in the cloud.

Next, we will see how to load data and explore it.

libraries needed are numpy and pandas.import them if not already done.

import numpy as np import pandas as pd

Now let us read the CSV files into data frames.

train=pd.read_csv("../input/train.csv") train.head()

This gives the following output

PassengerId Survived Pclass ... Fare Cabin Embarked 0 1 0 3 ... 7.2500 NaN S 1 2 1 1 ... 71.2833 C85 C 2 3 1 3 ... 7.9250 NaN S 3 4 1 1 ... 53.1000 C123 S 4 5 0 3 ... 8.0500 NaN S [5 rows x 12 columns]

it just shows as the first few rows of the dataset. Also, it says that there are 12 columns in the dataset.
To know more about this data, we try the following command
train.info()

and get the following information as output
RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns):
PassengerId 891 non-null int64
Survived 891 non-null int64
Pclass 891 non-null int64
Name 891 non-null object
Sex 891 non-null object
Age 714 non-null float64
SibSp 891 non-null int64
Parch 891 non-null int64
Ticket 891 non-null object
Fare 891 non-null float64
Cabin 204 non-null object
Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB

Firstly there are 891 rows age, cabin and embarked have null values and all others have nonnull values.

similarly, we can do for test.csv
continuing......

Kannan Balakrishnan Teaches Data Science

Thursday, 13 September 2018

The Titanic Challenge Step by Step in Python: part I introduction

No comments:

Post a Comment