Project 2 Report - Titanic Data Analysis¶

Meili Yang 10/25/2016¶

Question to investigate: If sex has an effect on survival rate? If so, what might be the root causes?¶

The variables that could directly effect on survival rate according to the data table can be 'Sex', 'Pclass',and 'Age'. The following report will be focus on how sex related with survival.

Two areas of investigation:¶

a. Survival rate factors:

survival rate (or number of people survived) vs. sex (calculate and compare female and male survival ratio)
survival rate (or number of people survived) vs. class (calculate and compare 1 2 3 class survival ratio)

b. Analysis of passenger information to see how other variables related with sex

patterns of 'Pclass' based on sex
age patterns based on sex

%matplotlib inline 
import numpy as np # imports a fast numerical programming library
import scipy as sp #imports stats functions, amongst other things
import matplotlib as mpl # this actually imports matplotlib
import matplotlib.cm as cm #allows us easy access to colormaps
import matplotlib.pyplot as plt #sets up plotting under plt
import pandas as pd #lets us handle data as dataframes
#sets up pandas table display
pd.set_option('display.width', 500)
pd.set_option('display.max_columns', 100)
pd.set_option('display.notebook_repr_html', True)
import seaborn as sns #sets up styles and gives us more plotting options

filename="titanic-data.csv"
titanic_df=pd.read_csv(filename)
titanic_df.head()

titanic_df.dtypes

PassengerId      int64
Survived         int64
Pclass           int64
Name            object
Sex             object
Age            float64
SibSp            int64
Parch            int64
Ticket          object
Fare           float64
Cabin           object
Embarked        object
dtype: object

titanic_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

In the data sample there are 891 passengers in total. Three variables have missing values, 'Age' and 'Cabin' and 'Embarked'. This report will be majorly focus on 'Sex' and 'Survived' relationship. 'Age' will be analyzed with 'Sex' to see their relationship. And there are 177 missing values, so when generating data for 'Age' and 'Sex' analysis, I will drop missing values and consider the impact of missing values when conclusion. For other variables which don't have missing values, I will use the entire dataset.¶

Survival rate factors:¶

Survival rate vs. Sex¶

#create a new DataFrame from groupby objects which grouped by if survived and each sex,
#and calculate the number of people in each group
survival_groupby_sex=pd.DataFrame({'count' : titanic_df.groupby( ['Survived','Sex'] ).size()}).reset_index()
survival_groupby_sex

#check if the total number of people from two DataFrame can match, 
#and assign total number of people, total number of people by set to each variable
print titanic_df['Sex'].count()
print survival_groupby_sex['count'].sum()
total_people=survival_groupby_sex['count'].sum()
total_people_by_sex=survival_groupby_sex.groupby('Sex')['count'].sum()
total_people_by_sex

891
891

Sex
female    314
male      577
Name: count, dtype: int64

#Calculate survival rate of all the passengers
total_survival=survival_groupby_sex.groupby('Survived').sum().reset_index()
total_survival_rate=total_survival['count'][1]/float(total_people)
print ('ratio of survived people out of all passangers', total_survival_rate)

('ratio of survived people out of all passangers', 0.38383838383838381)

# Calculate and assign variables to survival rate for each sex
female_survival_rate=233/314.0
male_survival_rate=109/577.0
print female_survival_rate
print male_survival_rate

0.742038216561
0.188908145581

# Plot stacked column chart to compare number of male and female among survived and non-survived passengers 
female=[233,81]
male=[109,468]
ind = np.arange(2)
width=0.4

fig, ax=plt.subplots()
rects1=ax.bar(ind, female, width, color=['#e9967a'])
rects2=ax.bar(ind, male, width,bottom=female, color=['#4169e1'])
ax.set_ylabel('Number of People')
ax.set_xticks(ind+0.2)
ax.set_xticklabels(('Survived','Non_survived'))
ax.set_yticks(np.arange(0,700,100))
ax.legend((rects1[0], rects2[0]), ('female', 'male'),loc=2)
ax.set_title("Counts of Genders by Survivors and Non-Survivors")

<matplotlib.text.Text at 0xb7fce10>

The chart shows that there are more females survived, and much more males not survived, which indicates that sex could effect on survival rate.¶

# plot stacked plot to compare survived vs. non-survived among all passengers, only female and only male.
survived=[342,233,109]
non_survived=[529,81,468]
ind = np.arange(3)
width=0.4

fig, ax=plt.subplots()
rects1=ax.bar(ind, survived, width, color='#9370db')
rects2=ax.bar(ind, non_survived, width,bottom=survived, color='#696969')

ax.set_ylabel('Number of People')
ax.set_xticks(ind+0.2)
ax.set_xticklabels(('total_people','female','male'))
ax.set_yticks(np.arange(0,1000,100))
ax.legend((rects1[0], rects2[0]), ('Survived', 'Non-survived'))
ax.set_title('Counts of Survivors and Non-Survivors for All Passengers, and Each Gender (Percentages are Survival Rates)')


labels=['{percent:.2%}'.format(percent=total_survival_rate),'{percent:.2%}'.format(percent=female_survival_rate),'{percent:.2%}'.format(percent=male_survival_rate)]
for rect, label in zip(rects1, labels):
    height=rect.get_height()
    ax.text(rect.get_x()+rect.get_width()/2,height/2, label, ha='center', va='bottom')

Observation: From above chart and calculations I found that around 38.38% of total people were survived, which leaves that more than half people were killed. The previous column chart showed that among survived, females takes a much larger proportion, and most of those who were not survived are males. In fact, the survival rate for females is 74.20%, while only 18.89% for males. So, survival rate is related with sex.
However, there are other factors which relate to sex and could also have impact on survival rate, such as passenger's class, age and so on. Below analysis will explore those factors.

Analysis of patterns for other variables which relates with sex¶

# Select 'Age' column for each sex, and clean empty age rows.
age_by_female=titanic_df[(titanic_df['Sex']=='female')]['Age'].dropna()
age_by_male=titanic_df[(titanic_df['Sex']=='male')]['Age'].dropna()

# Plot available age data as box chart for female and male
data=[age_by_female,age_by_male]
fig = plt.figure(1,figsize=(9,6))
ax=fig.add_subplot(111)
bp=ax.boxplot(data,patch_artist=True)
for box in bp['boxes']:
    # change outline color
    box.set( color='#7570b3', linewidth=2)
    # change fill color
    box.set_facecolor = '#1b9e77'

## change color and linewidth of the whiskers
for whisker in bp['whiskers']:
    whisker.set(color='#7570b3', linewidth=2)

## change color and linewidth of the caps
for cap in bp['caps']:
    cap.set(color='#7570b3', linewidth=2)

## change color and linewidth of the medians
for median in bp['medians']:
    median.set(color='#b2df8a', linewidth=2)

## change the style of fliers and their fill
for flier in bp['fliers']:
    flier.set(marker='o', color='#e7298a', alpha=0.5)

ax.set_xticklabels(['Female','Male'],fontsize=14)
ax.set_ylabel('Age')
ax.set_yticks(np.arange(0,100,10))
ax.set_xlabel('Gender')
ax.set_title('Box Chart for Age by Gender')

<matplotlib.text.Text at 0xbddd160>

The average age for female is a little younger than male. And there are some outliers for older male ages.¶

# Plot histogram for female and male ages to compare distribution
fig2 = plt.figure(1,figsize=(9,6))
ax2=fig2.add_subplot(111)
hist=ax2.hist(data,bins=15,label=['Female','Male'])
plt.style.use('seaborn-deep')
plt.legend(loc='upper right')
ax2.set_xlabel('Age')
ax2.set_ylabel('Number of passengers')
ax2.set_title('Histogram of Age by Gender')

<matplotlib.text.Text at 0xbed06a0>

From above charts, female passengers has a slightly younger average age than male, and there are more older males than females especially after 60 years old. Overall, the age between female and male have similar distribution.
From the analysis of age vs. sex, there is not a clear pattern from available age data which can explain the huge difference of survival rate causing by sex.

# Create a DataFrame to count the number of passengers grouped by sex and class.
sex_by_class=pd.DataFrame({'count' : titanic_df.groupby( ['Sex','Pclass'] ).size()}).reset_index()
sex_by_class

# Plot female and male as stacked bar chart for each class level.
class_female=sex_by_class['count'][0:3]
class_male=sex_by_class['count'][3:6]
ind = np.arange(3)
width=0.4

fig, ax=plt.subplots()
rects1=ax.bar(ind, class_female, width, color='#e9967a')
rects2=ax.bar(ind, class_male, width,bottom=class_female, color='#4169e1')

ax.set_ylabel('Number of People')
ax.set_xticks(ind+0.2)
ax.set_xticklabels(('Class1','Class2','Class3'))
ax.set_yticks(np.arange(0,600,100))
ax.legend((rects1[0], rects2[0]), ('Female', 'Male'), loc=2)
ax.set_xlabel('Passenger Class')
ax.set_title('Count of Passenger Gender by Class')

<matplotlib.text.Text at 0xbb12898>

Above graph shows that, in the first and second class there are about equal amount of males and females. However, there are much more males in the third class than females.
The pattern of female and male distribution for each class indicates that, the much less survival rate for male could be related with much more males stayed in the third class. Is passenger's class a variable ralated with survival rate?

Survival rate vs Passenger's class¶

# Create a DataFrame to count number of people grouped by if or not survived and 'Pclass'
survival_groupby_class=pd.DataFrame({'count' : titanic_df.groupby( ['Survived','Pclass'] ).size()}).reset_index()
survival_groupby_class

# Calculate survival rate for each class 
survival_rate_calculation=survival_groupby_class[survival_groupby_class['Survived']==1]['count'].reset_index()/survival_groupby_class.groupby('Pclass')['count'].sum().reset_index()
print survival_groupby_class[survival_groupby_class['Survived']==1]['count'].reset_index()
print survival_groupby_class.groupby('Pclass')['count'].sum().reset_index()
survival_rate=survival_rate_calculation['count']

   index  count
0      3    136
1      4     87
2      5    119
   Pclass  count
0       1    216
1       2    184
2       3    491

# Plot survived and non-survived people as stacked bar for each class level.
class_non_survival=survival_groupby_class['count'][0:3]
class_survival=survival_groupby_class['count'][3:6]
ind = np.arange(3)
width=0.4

fig, ax=plt.subplots()
rects1=ax.bar(ind, class_survival, width, color='#9370db')
rects2=ax.bar(ind, class_non_survival, width,bottom=class_survival, color='#696969')
#ax1=plt.bar(ind, survived, 0.4, color='r')
#ax2=plt.bar(ind, non_survived, 0.4,bottom=survived, color='k')
ax.set_ylabel('Number of People')
ax.set_xticks(ind+0.2)
ax.set_xticklabels(('Class1','Class2','Class3'))
ax.set_yticks(np.arange(0,600,100))
ax.legend((rects1[0], rects2[0]), ('Survived','Non_survived'), loc=2)
ax.set_xlabel('Passenger Class')
ax.set_title('Counts of Survivors and Non-Survivors by Passenger Class (Percentages are Survival Rates)')

labels=[]
for number in survival_rate:
    labels.append('{percent:.2%}'.format(percent=number))
    
for rect, label in zip(rects1, labels):
    height=rect.get_height()
    ax.text(rect.get_x()+rect.get_width()/2,height/2, label, ha='center', va='bottom')

The chart above shows that the first class has the highest survival rate which is more than 50%, and the third class has the lowest survival rate less than 30%. Hence, people stayed in the third class are less likely to be survived.
According to previous chart, that the majority sex in the third class is male. So the low survival rate for class 3 could directly correlate with the lower survival rate for the males.

Conclusion: The major factors that could directly relate with survivals are passenger's sex, class, and age. From analysis of sex and survival rate, it shows that sex has big impact on survival rate, that 74.20% of females were survived comparing with only 18.89% of males who were survived. Analysis on age vs. sex negelecting missing age data found that, the age distribution for male and female are similar. The analysis of passenger's class found that, it's the most people stayed in the third class and contains a large portion of males. Also the third class has the lowest survival rate (24.24%) much lesser than that for the first (62.96%) and second class (47.28%).
Among passenger's classes, Class 1 has the highest survival rate of 62.96%. In order to achieve a total 74.20% survival rate for female, it needs to have more than half of females were in class 1. However in fact less than half of females in class 1. So it indicates that the difference of survival rate between sex is not totally due to the different survival rates among classes.
So the overall conclusion is, the survival rate is related with passenger's sex. The relationship between survival rate and sex is also affected or partly contributed by the relationship between survival rate and class. There is not strong evidence if age could contribute to the relationship between survival rate and sex.
Limitations: It doesn't calculate if the average survival rates have statistically significant difference between sexes and classes. Without regression analysis, the conclusions are lack of precision of how much survival rate depends on sex and class. Besides, there could be other variables that relate with survival rate, for example ticket fare, embarked port, and if with or without relatives and how close the relatives in the table, also passenger's weight, faith, healthy condition etc not in the table which are not discussed. The last but not the least, the missing values are only excluded in the age and sex analysis. One of the pros is that it uses a bigger or entire dataset to analyze other variables than age, which gives a more precise conclusion about those variables. One of the cons is that, age and sex relationship is only based on partly data where missing values could contain extrems or outliers which can effect on the pattern. Also, it's not a consistant dataset to analyze different variables.

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S