Titanic Data Science Solutions

The notebook walks us through a typical workflow for solving data science competitions at sites like Kaggle.

Question

  • On April 15, 1912, during her maiden voyage, the Titanic sank after colliding with an iceberg, killing 1502 out of 2224 passengers and crew. Translated 32% survival rate.
  • One of the reasons that the shipwreck led to such loss of life was that there were not enough lifeboats for the passengers and crew.
  • Although there was some element of luck involved in surviving the sinking, some groups of people were more likely to survive than others, such as women, children, and the upper-class.

Steps

  • Please download the dataset from Kaggle website and copy to input/
  • Install python requirements using pip pip install -r requirements.txt
  • Run the notebook
#importing libraries
import pandas as pd
import numpy as np
import random as rnd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
#**********************************************************
#Acquiring dataset
train_data=pd.read_csv('../input/train.csv')
test_data=pd.read_csv('../input/test.csv')
together=[train_data, test_data]
#**************************************************************
#describing data
print(train_data.columns.values)
train_data.head()
train_data.drop(columns="Survived")
#****************************************************************
#Checking for missing data
print(train_data.isnull().sum())
print('_'*40)
print(test_data.isnull().sum())
#************************************************************
['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
________________________________________
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
#describing data
print(train_data.columns.values)
train_data.head()
train_data.drop(columns="Survived")
#*********************************************************
#Checking for missing data
print(train_data.isnull().sum())
print('_'*40)
print(test_data.isnull().sum())


['PassengerId' 'Survived' 'Pclass' 'Name' 'Sex' 'Age' 'SibSp' 'Parch'
 'Ticket' 'Fare' 'Cabin' 'Embarked']
PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64
________________________________________
PassengerId      0
Pclass           0
Name             0
Sex              0
Age             86
SibSp            0
Parch            0
Ticket           0
Fare             1
Cabin          327
Embarked         0
dtype: int64
train_data.head(30)
#we have categorical and numerical features in this dataset
#categorical: Survived, Sex, Embarked, Ordinal: Pclass
#Continous: Age, Fare. Discrete: : SibSp, Parch

#Ticket feature is an alphanumeric feature

PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
5 6 0 3 Moran, Mr. James male NaN 0 0 330877 8.4583 NaN Q
6 7 0 1 McCarthy, Mr. Timothy J male 54.0 0 0 17463 51.8625 E46 S
7 8 0 3 Palsson, Master. Gosta Leonard male 2.0 3 1 349909 21.0750 NaN S
8 9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27.0 0 2 347742 11.1333 NaN S
9 10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14.0 1 0 237736 30.0708 NaN C
10 11 1 3 Sandstrom, Miss. Marguerite Rut female 4.0 1 1 PP 9549 16.7000 G6 S
11 12 1 1 Bonnell, Miss. Elizabeth female 58.0 0 0 113783 26.5500 C103 S
12 13 0 3 Saundercock, Mr. William Henry male 20.0 0 0 A/5. 2151 8.0500 NaN S
13 14 0 3 Andersson, Mr. Anders Johan male 39.0 1 5 347082 31.2750 NaN S
14 15 0 3 Vestrom, Miss. Hulda Amanda Adolfina female 14.0 0 0 350406 7.8542 NaN S
15 16 1 2 Hewlett, Mrs. (Mary D Kingcome) female 55.0 0 0 248706 16.0000 NaN S
16 17 0 3 Rice, Master. Eugene male 2.0 4 1 382652 29.1250 NaN Q
17 18 1 2 Williams, Mr. Charles Eugene male NaN 0 0 244373 13.0000 NaN S
18 19 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... female 31.0 1 0 345763 18.0000 NaN S
19 20 1 3 Masselmani, Mrs. Fatima female NaN 0 0 2649 7.2250 NaN C
20 21 0 2 Fynney, Mr. Joseph J male 35.0 0 0 239865 26.0000 NaN S
21 22 1 2 Beesley, Mr. Lawrence male 34.0 0 0 248698 13.0000 D56 S
22 23 1 3 McGowan, Miss. Anna "Annie" female 15.0 0 0 330923 8.0292 NaN Q
23 24 1 1 Sloper, Mr. William Thompson male 28.0 0 0 113788 35.5000 A6 S
24 25 0 3 Palsson, Miss. Torborg Danira female 8.0 3 1 349909 21.0750 NaN S
25 26 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... female 38.0 1 5 347077 31.3875 NaN S
26 27 0 3 Emir, Mr. Farred Chehab male NaN 0 0 2631 7.2250 NaN C
27 28 0 1 Fortune, Mr. Charles Alexander male 19.0 3 2 19950 263.0000 C23 C25 C27 S
28 29 1 3 O'Dwyer, Miss. Ellen "Nellie" female NaN 0 0 330959 7.8792 NaN Q
29 30 0 3 Todoroff, Mr. Lalio male NaN 0 0 349216 7.8958 NaN S
train_data.info()
print('__'*50)
test_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB
____________________________________________________________________________________________________
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 418 entries, 0 to 417
Data columns (total 11 columns):
PassengerId    418 non-null int64
Pclass         418 non-null int64
Name           418 non-null object
Sex            418 non-null object
Age            332 non-null float64
SibSp          418 non-null int64
Parch          418 non-null int64
Ticket         418 non-null object
Fare           417 non-null float64
Cabin          91 non-null object
Embarked       418 non-null object
dtypes: float64(2), int64(4), object(5)
memory usage: 36.0+ KB
train_data.describe()

#the total number of passenger is 891 person. 
print("total number of passenger is: " + str(train_data["PassengerId"].count()))

ax=train_data.hist(column= "Age", bins=10)
plt.title("Histomgram of age")
plt.xlabel("Age")
plt.ylabel("number of people")
#Percentage of Survived persons

print("Percentage of survived passengers out of all the passengers is: "+ str((train_data["Survived"]== 1).sum()*100/train_data["PassengerId"].count()))
print("Percentage of passengers who travel without siblings or spouse :"+ str((train_data["SibSp"]== 0).sum()*100/train_data["PassengerId"].count()))
print("Percentage of passengers who travel with one sibling or spouse :"+ str((train_data["SibSp"]== 1).sum()*100/train_data["PassengerId"].count()))      
print("Percentage of passengers who paid at most $222: " + str((train_data["Fare"] <= 222).sum()*100/train_data["PassengerId"].count())) 
print(" Percentage of passengers who were 40 or more than 40 years old: "+ str((train_data["Age"] >= 40).sum()*100/train_data["PassengerId"].count())) 
print("Percentage of survived passengers who were more than 70 years old: "+str ((train_data["Age"] > 70).sum()*100/(train_data["Survived"]==1).sum()))

print("-------rate of survived passengers in each class-------")
print("Pclass=1: "+str(((train_data["Pclass"] == 1).sum()*100)/(train_data["Survived"]==1).sum()))
print("Pclass=2: " +str((train_data["Pclass"] == 2).sum()*100/(train_data["Survived"]==1).sum()))
print("Pclass=3: "+ str((train_data["Pclass"] == 3).sum()*100/(train_data["Survived"]==1).sum()))


ax7=train_data.hist(column= "Fare", bins=10, weights=np.ones_like(train_data["Fare"]) * 100. / len(train_data["Fare"]))
# train_data["Fare"].plot(kind='hist', density=1, bins=20, stacked=False, alpha=.5)
# ax2=train_data["Fare"].hist(bins=20, weights=np.ones_like(train_data["Fare"]) * 100. / len(train_data["Fare"]))
plt.title("Histogram of Fare")
plt.xlabel("Fare")
plt.ylabel("Percentage of passenger")

print("________________rate of survived passenger_________________________")
print("the percentage of male passenger is: "+str((train_data["Sex"]== "male").sum()*100/(train_data["PassengerId"]).count()))
print("the percentage of female passenger is :"+ str((train_data["Sex"]== "female").sum()*100/(train_data["PassengerId"]).count()))
print("the percentage of male passengers who survived is: "+ str((train_data["Sex"]== "male").sum()*100/(train_data["Survived"]==1).sum()))
print("the percentage of female passengers who survived is: "+str((train_data["Sex"]== "female").sum()*100/(train_data["Survived"]==1).sum()))


total number of passenger is: 891
Percentage of survived passengers out of all the passengers is: 38.38383838383838
Percentage of passengers who travel without siblings or spouse :68.23793490460157
Percentage of passengers who travel with one sibling or spouse :23.45679012345679
Percentage of passengers who paid at most $222: 98.31649831649831
 Percentage of passengers who were 40 or more than 40 years old: 18.29405162738496
Percentage of survived passengers who were more than 70 years old: 1.4619883040935673
-------rate of survived passengers in each class-------
Pclass=1: 63.1578947368421
Pclass=2: 53.801169590643276
Pclass=3: 143.5672514619883
________________rate of survived passenger_________________________
the percentage of male passenger is: 64.75869809203142
the percentage of female passenger is :35.24130190796858
the percentage of male passengers who survived is: 168.71345029239765
the percentage of female passengers who survived is: 91.81286549707602

png

png

train_data.describe(include=['O'])

Name Sex Ticket Cabin Embarked
count 891 891 891 204 889
unique 891 2 681 147 3
top Bourke, Mrs. John (Catherine) male 347082 G6 S
freq 1 577 7 4 644

print(type(train_data.groupby('Sex').sum()))
table1=train_data.groupby("Sex").sum()
print(table1)
table2=table1.drop("PassengerId",1)
table2.groupby('Sex').sum().plot(kind='bar')

<class 'pandas.core.frame.DataFrame'>
        PassengerId  Survived  Pclass     ...      SibSp  Parch        Fare
Sex                                       ...                              
female       135343       233     678     ...        218    204  13966.6628
male         262043       109    1379     ...        248    136  14727.2865

[2 rows x 7 columns]





<matplotlib.axes._subplots.AxesSubplot at 0x7f01cf1a7240>

png

train_data[["Pclass", "PassengerId"]].groupby(["Pclass"], as_index=False).count()

Pclass PassengerId
0 1 216
1 2 184
2 3 491
df=train_data[["Sex", "Survived"]].groupby(['Sex'], as_index=False).sum() 
fig,ax=plt.bar(x=df["Sex"], height=df["Survived"])
# print(df)
print(train_data["Sex"].value_counts())

male      577
female    314
Name: Sex, dtype: int64

png

da=train_data[["SibSp", "Survived"]].groupby(['SibSp'], as_index=False).mean().sort_values(by="Survived", ascending=False)
print(da)

   SibSp  Survived
1      1  0.535885
2      2  0.464286
0      0  0.345395
3      3  0.250000
4      4  0.166667
5      5  0.000000
6      8  0.000000
train_data[["Parch", "Survived"]].groupby(['Parch'], as_index=False).mean().sort_values(by='Survived', ascending=False)


Parch Survived
3 3 0.600000
1 1 0.550847
2 2 0.500000
0 0 0.343658
5 5 0.200000
4 4 0.000000
6 6 0.000000

print((train_data["Age"] <= 20).sum()/(train_data["Survived"]==0).sum())
print((train_data["Age"] <= 20).sum()/(train_data["Survived"]==1).sum())

g=sns.FacetGrid(train_data, col= "Survived")
g.map(plt.hist,"Age", bins=5)
0.32604735883424407
0.5233918128654971





<seaborn.axisgrid.FacetGrid at 0x7f01cfd6f128>

png

# grid = sns.FacetGrid(train_df, col='Pclass', hue='Survived')
#grid = sns.FacetGrid(train_data, col='Survived', row='Pclass')
#grid.map(plt.hist, 'Age', alpha=.5, bins=20)


g1=sns.FacetGrid(train_data, col="Survived", row="Pclass")
g1.map(plt.hist, "Age", alpha=0.5, bins=20)
g1.add_legend();

png


grid = sns.FacetGrid(train_data, row='Embarked', size=2.2, aspect=1.6)
grid.map(sns.pointplot, 'Pclass', 'Survived', 'Sex', palette='deep')
grid.add_legend()
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:230: UserWarning: The `size` paramter has been renamed to `height`; please update your code.
  warnings.warn(msg, UserWarning)
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:715: UserWarning: Using the pointplot function without specifying `order` is likely to produce an incorrect plot.
  warnings.warn(warning)
/opt/conda/lib/python3.6/site-packages/seaborn/axisgrid.py:720: UserWarning: Using the pointplot function without specifying `hue_order` is likely to produce an incorrect plot.
  warnings.warn(warning)
/opt/conda/lib/python3.6/site-packages/scipy/stats/stats.py:1713: FutureWarning: Using a non-tuple sequence for multidimensional indexing is deprecated; use `arr[tuple(seq)]` instead of `arr[seq]`. In the future this will be interpreted as an array index, `arr[np.array(seq)]`, which will result either in an error or a different result.
  return np.add.reduce(sorted[indexer] * weights, axis=axis) / sumval





<seaborn.axisgrid.FacetGrid at 0x7f01cf0851d0>

png

#correcting
#Ticket, cabin and passengerId can be dropped from the data as they have duplication, incompleteness and no contribution to the survival rate respectively.

print("Before", train_data.shape, test_data.shape, together[0].shape, together[1].shape)
#now we drop 
train_datan = train_data.drop(['Ticket', 'Cabin', 'PassengerId'], axis=1)
test_datan = test_data.drop(['Ticket', 'Cabin', 'PassengerId'], axis=1)
togethern = [train_datan, test_datan]

"After", train_datan.shape, test_datan.shape, togethern[0].shape, togethern[1].shape
Before (891, 12) (418, 11) (891, 12) (418, 11)





('After', (891, 9), (418, 8), (891, 9), (418, 8))

Creating new feature extracting from existing

We want to analyze if Name feature can be engineered to extract titles and test correlation between titles and survival, before dropping Name and PassengerId features.

In the following code we extract Title feature using regular expressions. The RegEx pattern (\w+\.) matches the first word which ends with a dot character within Name feature. The expand=False flag returns a DataFrame.

Observations.

When we plot Title, Age, and Survived, we note the following observations.

  • Most titles band Age groups accurately. For example: Master title has Age mean of 5 years.
  • Survival among Title Age bands varies slightly.
  • Certain titles mostly survived (Mme, Lady, Sir) or did not (Don, Rev, Jonkheer).

Decision.

  • We decide to retain the new Title feature for model training.

Converting a categorical feature

Now we can convert features which contain strings to numerical values. This is required by most model algorithms. Doing so will also help us in achieving the feature completing goal.

Let us start by converting Sex feature to a new feature called Gender where female=1 and male=0.

train_datan.head()
train_datan["Sex"].isnull().sum()


0

guess_ages=np.zeros((2,3))

for dataset in togethern:
    for i in range(0, 2):
        for j in range(0, 3):
            guess_df = dataset[(dataset['Sex'] == i) & \
                                  (dataset['Pclass'] == j+1)]['Age'].dropna()

            age_guess = guess_df.median()

            # Convert random age float to nearest .5 age
            guess_ages[i,j] = int( age_guess/0.5 + 0.5 ) * 0.5
            
    for i in range(0, 2):
        for j in range(0, 3):
            dataset.loc[ (dataset.Age.isnull()) & (dataset.Sex == i) & (dataset.Pclass == j+1),\
                    'Age'] = guess_ages[i,j]

    dataset['Age'] = dataset['Age'].astype(int)


train_datan.head(50)

Survived Pclass Name Sex Age Fare Embarked
0 0 3 Braund, Mr. Owen Harris 0 22 7.2500 0
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 1 38 71.2833 1
2 1 3 Heikkinen, Miss. Laina 1 26 7.9250 0
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 1 35 53.1000 0
4 0 3 Allen, Mr. William Henry 0 35 8.0500 0
5 0 3 Moran, Mr. James 0 25 8.4583 2
6 0 1 McCarthy, Mr. Timothy J 0 54 51.8625 0
7 0 3 Palsson, Master. Gosta Leonard 0 2 21.0750 0
8 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) 1 27 11.1333 0
9 1 2 Nasser, Mrs. Nicholas (Adele Achem) 1 14 30.0708 1
10 1 3 Sandstrom, Miss. Marguerite Rut 1 4 16.7000 0
11 1 1 Bonnell, Miss. Elizabeth 1 58 26.5500 0
12 0 3 Saundercock, Mr. William Henry 0 20 8.0500 0
13 0 3 Andersson, Mr. Anders Johan 0 39 31.2750 0
14 0 3 Vestrom, Miss. Hulda Amanda Adolfina 1 14 7.8542 0
15 1 2 Hewlett, Mrs. (Mary D Kingcome) 1 55 16.0000 0
16 0 3 Rice, Master. Eugene 0 2 29.1250 2
17 1 2 Williams, Mr. Charles Eugene 0 30 13.0000 0
18 0 3 Vander Planke, Mrs. Julius (Emelia Maria Vande... 1 31 18.0000 0
19 1 3 Masselmani, Mrs. Fatima 1 21 7.2250 1
20 0 2 Fynney, Mr. Joseph J 0 35 26.0000 0
21 1 2 Beesley, Mr. Lawrence 0 34 13.0000 0
22 1 3 McGowan, Miss. Anna "Annie" 1 15 8.0292 2
23 1 1 Sloper, Mr. William Thompson 0 28 35.5000 0
24 0 3 Palsson, Miss. Torborg Danira 1 8 21.0750 0
25 1 3 Asplund, Mrs. Carl Oscar (Selma Augusta Emilia... 1 38 31.3875 0
26 0 3 Emir, Mr. Farred Chehab 0 25 7.2250 1
27 0 1 Fortune, Mr. Charles Alexander 0 19 263.0000 0
28 1 3 O'Dwyer, Miss. Ellen "Nellie" 1 21 7.8792 2
29 0 3 Todoroff, Mr. Lalio 0 25 7.8958 0
30 0 1 Uruchurtu, Don. Manuel E 0 40 27.7208 1
31 1 1 Spencer, Mrs. William Augustus (Marie Eugenie) 1 35 146.5208 1
32 1 3 Glynn, Miss. Mary Agatha 1 21 7.7500 2
33 0 2 Wheadon, Mr. Edward H 0 66 10.5000 0
34 0 1 Meyer, Mr. Edgar Joseph 0 28 82.1708 1
35 0 1 Holverson, Mr. Alexander Oskar 0 42 52.0000 0
36 1 3 Mamee, Mr. Hanna 0 25 7.2292 1
37 0 3 Cann, Mr. Ernest Charles 0 21 8.0500 0
38 0 3 Vander Planke, Miss. Augusta Maria 1 18 18.0000 0
39 1 3 Nicola-Yarred, Miss. Jamila 1 14 11.2417 1
40 0 3 Ahlin, Mrs. Johan (Johanna Persdotter Larsson) 1 40 9.4750 0
41 0 2 Turpin, Mrs. William John Robert (Dorothy Ann ... 1 27 21.0000 0
42 0 3 Kraeff, Mr. Theodor 0 25 7.8958 1
43 1 2 Laroche, Miss. Simonne Marie Anne Andree 1 3 41.5792 1
44 1 3 Devaney, Miss. Margaret Delia 1 19 7.8792 2
45 0 3 Rogers, Mr. William John 0 25 8.0500 0
46 0 3 Lennon, Mr. Denis 0 25 15.5000 2
47 1 3 O'Driscoll, Miss. Bridget 1 21 7.7500 2
48 0 3 Samaan, Mr. Youssef 0 25 21.6792 1
49 0 3 Arnold-Franchi, Mrs. Josef (Josefine Franchi) 1 18 17.8000 0
train_datan = train_datan.drop(['Parch', 'SibSp'], axis=1)
test_datan = test_datan.drop(['Parch', 'SibSp'], axis=1)
togethern = [train_datan, test_datan]


We can also create an artificial feature combining Pclass and Age.

freq_data = train_datan.Embarked.mode()[0]
freq_data
for dataset in togethern:
    dataset["Embarked"]=dataset["Embarked"].fillna(freq_data)
train_datan[["Embarked", "Survived"]].groupby(["Embarked"], as_index=False).mean().sort_values(by="Survived", ascending=False)

Embarked Survived
0 C 0.553571
1 Q 0.389610
2 S 0.339009
for dataset in togethern:
     dataset['Embarked'] = dataset['Embarked'].map( {'S': 0, 'C': 1, 'Q': 2} ).astype(int)




togethern[0].head()

Survived Pclass Name Sex Age Fare Embarked
0 0 3 Braund, Mr. Owen Harris male 22.0 7.2500 0
1 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 71.2833 1
2 1 3 Heikkinen, Miss. Laina female 26.0 7.9250 0
3 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 53.1000 0
4 0 3 Allen, Mr. William Henry male 35.0 8.0500 0
for dataset in togethern:
    dataset['Sex'] = dataset['Sex'].map( {'female': 1, 'male': 0} ).astype(int)


togethern[1].head()

Pclass Name Sex Age Fare Embarked
0 3 Kelly, Mr. James 0 34.5 7.8292 2
1 3 Wilkes, Mrs. James (Ellen Needs) 1 47.0 7.0000 0
2 2 Myles, Mr. Thomas Francis 0 62.0 9.6875 2
3 3 Wirz, Mr. Albert 0 27.0 8.6625 0
4 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 1 22.0 12.2875 0
test_datan.head(10)

Pclass Name Sex Age Fare Embarked
0 3 Kelly, Mr. James 0 34.5 7.8292 2
1 3 Wilkes, Mrs. James (Ellen Needs) 1 47.0 7.0000 0
2 2 Myles, Mr. Thomas Francis 0 62.0 9.6875 2
3 3 Wirz, Mr. Albert 0 27.0 8.6625 0
4 3 Hirvonen, Mrs. Alexander (Helga E Lindqvist) 1 22.0 12.2875 0
5 3 Svensson, Mr. Johan Cervin 0 14.0 9.2250 0
6 3 Connolly, Miss. Kate 1 30.0 7.6292 2
7 2 Caldwell, Mr. Albert Francis 0 26.0 29.0000 0
8 3 Abrahim, Mrs. Joseph (Sophie Halaut Easu) 1 18.0 7.2292 1
9 3 Davies, Mr. John Samuel 0 21.0 24.1500 0
freq_data = train_datan.Fare.mode()[0]
freq_data
for dataset in togethern:
    dataset["Fare"]=dataset["Fare"].fillna(freq_data)

test_datan.isnull().sum()
Pclass      0
Name        0
Sex         0
Age         0
Fare        0
Embarked    0
dtype: int64

Model, predict and solve

X_train = train_datan.drop(["Survived","Name"], axis=1)
Y_train = train_datan["Survived"]
X_test  = test_datan.drop(["Name"], axis=1)
X_train.shape, Y_train.shape, X_test.shape
X_test

Logistic Regression is a useful model to run early in the workflow. Logistic regression measures the relationship between the categorical dependent variable (feature) and one or more independent variables (features) by estimating probabilities using a logistic function, which is the cumulative logistic distribution. Reference Wikipedia.

Note the confidence score generated by the model based on our training dataset.

X_test.isnull().sum()
# Logistic Regression

#reg = LogisticRegression()
#reg.fit(X_train, Y_train)
#Y_pred = reg.predict(X_test)
#acc-log = (reg.score(X_train, Y_train) * 100, 2)
#acc-log

We can use Logistic Regression to validate our assumptions and decisions for feature creating and completing goals. This can be done by calculating the coefficient of the features in the decision function.

Positive coefficients increase the log-odds of the response (and thus increase the probability), and negative coefficients decrease the log-odds of the response (and thus decrease the probability).

  • Sex is highest positivie coefficient, implying as the Sex value increases (male: 0 to female: 1), the probability of Survived=1 increases the most.
  • Inversely as Pclass increases, probability of Survived=1 decreases the most.
  • This way Age*Class is a good artificial feature to model as it has second highest negative correlation with Survived.
  • So is Title as second highest positive correlation.
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
knnscore=round(knn.score(X_train, Y_train) * 100, 2)
knnscore

# Support Vector Machines

svc = SVC()
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
acc_svc = round(svc.score(X_train, Y_train) * 100, 2)
acc_svc
Avatar
Shiva Shabro
Technical Analyst at the University Health Network (UHN)