Survival Analysis¶

TOC¶

Survival Analysis
CODE
- Kaplan-Meier fitter
- Kaplan-Meier fitter Based on Different Groups
- Nelson-Aalen fitter Theory
- Log-Rank-Test
- Cox-Regression
- AIC and model selection for parametric models

Library: https://pypi.org/project/lifelines/

Document: https://lifelines.readthedocs.io/en/latest/index.html

Dataset: http://www-eio.upc.edu/~pau/cms/rdata/datasets.html

Survival Analysis¶

Staticial approaches to determine the time an event of interest to occurs. The event of interest could be death, recover, birth, retirement, etc.

What Analysis¶

To find number of days until patients showed symptoms.
To compare number of day of event of interest amoung various groups.
To find which treatment has the highest survival probability.
To find the median number of survial for patients.
To compare which factor has more impact on patients's survival.

Use case¶

Medical

Cancer patinents survial time analysis
Medical treatment time analysis

Operation

Machine failure time analysis
Operation production failure analysis
Machine critical level
Repair period analysis

Business

Warranty claim period analysis
Lead to sales time analysis
Hire to salesperson hires to first sale analysis
Hire to termination or quit time analysis

Other

Event history analysis

Concept¶

Survival time refers to as an amount of time until when a subject is alive or actively participates in a sutdy period.

Main types of events:¶

Relapse (relapse-free survival): The length of time after treatment and get well.
Progression (progression-free survival): The length of time after treatment and still lives with disease but not get worse.
Death: Destruction or permanent end of study.

Censoring is a form of missing data problem in which time to event is not observed for reasons.

Survival and hazard functions

Survival Function(S): The probability that an subject survives from the time origin (diagnosis of a disease) to a specified future time t. For example, S(100)=0.8 means that after 100 days, a subject’s survival probability is 0.8.
Hazard Function (H): The probability that an subject who is under observation at a time t has an event (death) at that time. For example, If h(100) = 0.8 means that after 100 days, the probability of being dead is 0.8.

CODE¶

!pip install lifelines

Requirement already satisfied: lifelines in /usr/local/lib/python3.6/dist-packages (0.25.5)
Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.4.1)
Requirement already satisfied: autograd-gamma>=0.3 in /usr/local/lib/python3.6/dist-packages (from lifelines) (0.4.3)
Requirement already satisfied: autograd>=1.3 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.3)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.18.5)
Requirement already satisfied: matplotlib>=3.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (3.2.2)
Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.1.2)
Requirement already satisfied: patsy>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (0.5.1)
Requirement already satisfied: future>=0.15.2 in /usr/local/lib/python3.6/dist-packages (from autograd>=1.3->lifelines) (0.16.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (2.4.7)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0->lifelines) (2018.9)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from patsy>=0.5.0->lifelines) (1.15.0)

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib
from urllib import request
import lifelines
from lifelines import KaplanMeierFitter

print('lifelines version: ', lifelines.__version__)

lifelines version:  0.25.5

# Download data 

csv_src = "http://www-eio.upc.edu/~pau/cms/rdata/csv/survival/lung.csv"
csv_path = 'lung.csv'
urllib.request.urlretrieve(csv_src, csv_path)

('lung.csv', <http.client.HTTPMessage at 0x7f26d1459160>)

#  Create dataframe

df = pd.read_csv(csv_path)

#  Precessing
df.loc[df.status == 1, 'dead'] = 0
df.loc[df.status == 2, 'dead'] = 1
df = df.drop(['Unnamed: 0', 'status'], axis=1)

df.head()

#  Get statistical info

df.describe()

#  Kaplan-Meier-Fitter
kmf = KaplanMeierFitter()
kmf.fit(durations = df.time, event_observed = df.dead)

<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 228 total observations, 63 right-censored observations>

#  Create event table

#  event_at: Timeline of dataset
#  removed:  The values of subject that are no longer part of our experiment
#  observed: The value of the number of subjects that died during the experiment
#  censored: The number of subject does not have an event during the observation time
#  entrance: The value of new subject in a given timeline
#  at_risk:  The number of current subject under observation

kmf.event_table

S(t) = (Number of subjects at risk at the start - Number of subjects that died) / Number of subjects at risk at the start

#  Calculate survival probability  for given time

print("Survival probablity for t=0: ", kmf.predict(0))
print("Survival probablity for t=5: ", kmf.predict(5))
print("Survival probablity for t=11: ", kmf.predict(11))
print("Survival probablity for t=840: ", kmf.predict(840))

print(kmf.predict([0,5,11,840]))

Survival probablity for t=0:  1.0
Survival probablity for t=5:  0.9956140350877193
Survival probablity for t=11:  0.9824561403508766
Survival probablity for t=840:  0.06712742409441387
0      1.000000
5      0.995614
11     0.982456
840    0.067127
Name: KM_estimate, dtype: float64

#  Get survial full list
print(kmf.survival_function_)

          KM_estimate
timeline             
0.0          1.000000
5.0          0.995614
11.0         0.982456
12.0         0.978070
13.0         0.969298
...               ...
840.0        0.067127
883.0        0.050346
965.0        0.050346
1010.0       0.050346
1022.0       0.050346

[187 rows x 1 columns]

#  Plot the survival graph

kmf.plot()
plt.title("Kaplan-Meier")
plt.xlabel("Number o days")
plt.ylabel("Probablity of survival")
plt.show()

#  Get survial time median

survival_time_median = kmf.median_survival_time_
print(" The median survial time: ", survival_time_median)

 The median survial time:  310.0

#  Survival probability wiht confidence interval

survival_confidence_interval = kmf.confidence_interval_survival_function_
survival_confidence_interval

#  Plot survival function with confidence interval

plt.plot(survival_confidence_interval['KM_estimate_lower_0.95'], label="Lower")
plt.plot(survival_confidence_interval['KM_estimate_upper_0.95'], label="Upper")
plt.title("Survival Function with confidence interval")
plt.xlabel("Number of days")
plt.ylabel("Survial Probablity")
plt.legend()
plt.show()

#  Cumulative survival
cumu_surv = kmf.cumulative_density_
cumu_surv

#  Plot cumulative density

kmf.plot_cumulative_density()
plt.title("Cumulative density")
plt.xlabel("Number of days")
plt.ylabel("Probablity of death")
plt.show()

# Median time left for event

median_time_to_event = kmf.conditional_time_to_event_
plt.plot(median_time_to_event, label = "Median  time left")
plt.title("Median time to event")
plt.xlabel("Total days")
plt.ylabel("Conditional median time to event")
plt.legend()
plt.show()

Kaplan-Meier Estimator with groups¶

# KM for Male and Female

kmf_m = KaplanMeierFitter()
kmf_f = KaplanMeierFitter()

Male = df.query("sex == 1")
Female = df.query("sex == 2")

kmf_m.fit(durations=Male.time, event_observed=Male.dead, label="Male")
kmf_f.fit(durations=Female.time, event_observed=Female.dead, label="Female")

<lifelines.KaplanMeierFitter:"Female", fitted with 90 total observations, 37 right-censored observations>

# Plot survival function

kmf_m.plot()
kmf_f.plot()
plt.title("KM Survival Probaility: Male vs Female")
plt.xlabel("Days")
plt.ylabel("Survival Probability")
plt.show()

#  Plot cumulative density

kmf_m.plot_cumulative_density()
kmf_f.plot_cumulative_density()
plt.title("Cumulative density")
plt.xlabel("Number of days")
plt.ylabel("Probablity of death")
plt.show()

Hazard function: Nelson-Aalen¶

from lifelines import NelsonAalenFitter

naf = NelsonAalenFitter()

naf.fit(durations = df.time, event_observed = df.dead)

<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 228 total observations, 63 right-censored observations>

H(t) = Number of subjects that died / Number of subjects at risk

#  Cumulative hazard

naf.cumulative_hazard_

#  Plot cumulative hazard

naf.plot_cumulative_hazard()
plt.title("Cumulative Hazard")
plt.xlabel("Number of days")
plt.ylabel("Cumulative Probability of death")
plt.show()

#  Predict

print("t = 500 days: ", naf.predict(500))

t = 500 days:  1.219546171331098

#  cumulative hazard with confidence interval

hazard_confidence_interval = naf.confidence_interval_

plt.plot(hazard_confidence_interval['NA_estimate_lower_0.95'], label="Lower")
plt.plot(hazard_confidence_interval['NA_estimate_upper_0.95'], label="Upper")
plt.title("Cumulative harzard Function with confidence interval")
plt.xlabel("Number of days")
plt.ylabel("Cumulative hazard")
plt.legend()
plt.show()

#  Cumulative hazard vs cumulative density

kmf.plot_cumulative_density(label="Cumulative Hazard")
naf.plot_cumulative_hazard(label="Cumulative Density")
plt.xlabel("Number of days")
plt.show()

Male vs Female

naf_m = NelsonAalenFitter()
naf_f = NelsonAalenFitter()

naf_m.fit(durations = Male.time, event_observed = Male.dead)
naf_f.fit(durations = Female.time, event_observed = Female.dead)

<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 90 total observations, 37 right-censored observations>

#  Plot cumulative hazard Male vs Female

naf_m.plot_cumulative_hazard(label="Male")
naf_f.plot_cumulative_hazard(label="Female")
plt.title("Cumulative Hazard")
plt.xlabel("Number of days")
plt.ylabel("Cumulative Probability of death")
plt.show()

Log-rank test¶

Log-rank test is a hypothesis test survival distribution of two samples.

A p-value between 0 and 1 denotes the statistical significance. The smaller the p-value, the more significant the statistical difference between groups being studied is.

Less than (5% = 0.05) P-value means there is a significant difference between the groups we compared.

from lifelines.statistics import logrank_test

time_m = Male.time
event_m = Male.dead

time_f = Female.time
event_f = Female.dead

results = logrank_test(time_m, time_f, event_m, event_f)
results.print_summary()
print("P-value: ", results.p_value)

P-value:  0.001311164520355457

survival function for both groups is significantly different.

Cox-proportional hazard model¶

The cox-proportional hazard model is a regression model generally used by medical researchers to determine the relationship between the survival time of a subject and one or more predictor variables.

h(t) = h0(t) * exp(b1X1 + b2X2 + ... + bnXn)

Where,

t = survivaltime

h(t) = hazard function

X1, X2, ..., Xn = covariates

b1, b2, ..., bn = measures the impact of covariates

from lifelines import CoxPHFitter

df2 = pd.read_csv("lung.csv")
df2 = df2.drop(['Unnamed: 0'], axis=1)
df2 = df2.dropna(subset=['inst', 'time', 'status', 'age', 'sex', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal', 'wt.loss'])
df2.loc[df2.status == 1, 'dead'] = 0
df2.loc[df2.status == 2, 'dead'] = 1
df2 = df2.drop(['status'], axis=1)

df2.head()

df2.describe()

cph = CoxPHFitter()
cph.fit(df2, duration_col='time', event_col='dead')
cph.print_summary()

#  All covariates forest plot

cph.plot()
plt.show()

#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='sex', values=[1, 2], cmap='coolwarm')
plt.title("Survival curves comparison: sex")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()

#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='ph.ecog', values=[0, 1, 2, 3], cmap='coolwarm')
plt.title("Survival curves comparison: ph.ecog")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()

#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='inst', values=[1, 3, 11, 15, 32], cmap='coolwarm')
plt.title("Survival curves comparison: inst")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()

#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='ph.karno', values=[50, 70, 80, 90, 100], cmap='coolwarm')
plt.title("Survival curves comparison: ph.karno")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()

AIC and model selection for parametric models¶

semi-parametric models (Cox model and Aalen’s model)

Cox Proportional Hazards regression can still fit survival models without knowing (or assuming) the distribution.

a family of parametric models

a log-normal, or a Weibull, or other parametric distribution make that assumption the maximum likelihood parametric approach

from lifelines import LogLogisticAFTFitter, WeibullAFTFitter, LogNormalAFTFitter

llf = LogLogisticAFTFitter().fit(df2, 'time', 'dead')
lnf = LogNormalAFTFitter().fit(df2, 'time', 'dead')
wf = WeibullAFTFitter().fit(df2, 'time', 'dead')

print(llf.AIC_)
print(lnf.AIC_)
print(wf.AIC_)

1681.6308575665694
1694.8367594080894
1668.3198611758

	inst	time	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss	dead
count	227.000000	228.000000	228.000000	228.000000	227.000000	227.000000	225.000000	181.000000	214.000000	228.000000
mean	11.088106	305.232456	62.447368	1.394737	0.951542	81.938326	79.955556	928.779006	9.831776	0.723684
std	8.303491	210.645543	9.073457	0.489870	0.717872	12.327955	14.623177	402.174707	13.139902	0.448159
min	1.000000	5.000000	39.000000	1.000000	0.000000	50.000000	30.000000	96.000000	-24.000000	0.000000
25%	3.000000	166.750000	56.000000	1.000000	0.000000	75.000000	70.000000	635.000000	0.000000	0.000000
50%	11.000000	255.500000	63.000000	1.000000	1.000000	80.000000	80.000000	975.000000	7.000000	1.000000
75%	16.000000	396.500000	69.000000	2.000000	1.000000	90.000000	90.000000	1150.000000	15.750000	1.000000
max	33.000000	1022.000000	82.000000	2.000000	3.000000	100.000000	100.000000	2600.000000	68.000000	1.000000

	KM_estimate_lower_0.95	KM_estimate_upper_0.95
0.0	1.000000	1.000000
5.0	0.969277	0.999381
11.0	0.953935	0.993379
12.0	0.948120	0.990813
13.0	0.936682	0.985244
...	...	...
840.0	0.030728	0.123060
883.0	0.017866	0.108662
965.0	0.017866	0.108662
1010.0	0.017866	0.108662
1022.0	0.017866	0.108662

t_0	-1
null_distribution	chi squared
degrees_of_freedom	1
test_name	logrank_test

	inst	time	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss	dead
count	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000	167.000000
mean	10.706587	309.934132	62.568862	1.383234	0.958084	82.035928	79.580838	929.125749	9.718563	0.718563
std	8.167900	209.435591	9.210706	0.487637	0.731011	12.778885	15.104188	413.489837	13.378587	0.451053
min	1.000000	5.000000	39.000000	1.000000	0.000000	50.000000	30.000000	96.000000	-24.000000	0.000000
25%	3.000000	174.500000	57.000000	1.000000	0.000000	70.000000	70.000000	619.000000	0.000000	0.000000
50%	11.000000	268.000000	64.000000	1.000000	1.000000	80.000000	80.000000	975.000000	7.000000	1.000000
75%	15.000000	419.500000	70.000000	2.000000	1.000000	90.000000	90.000000	1162.500000	15.000000	1.000000
max	32.000000	1022.000000	82.000000	2.000000	3.000000	100.000000	100.000000	2600.000000	68.000000	1.000000

model	lifelines.CoxPHFitter
duration col	'time'
event col	'dead'
baseline estimation	breslow
number of observations	167
number of events observed	120
partial log-likelihood	-491.27
time fit was run	2020-10-11 01:38:38 UTC

	inst	time	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss	dead
0	3.0	306	74	1	1.0	90.0	100.0	1175.0	NaN	1.0
1	3.0	455	68	1	0.0	90.0	90.0	1225.0	15.0	1.0
2	3.0	1010	56	1	0.0	90.0	90.0	NaN	15.0	0.0
3	5.0	210	57	1	1.0	90.0	60.0	1150.0	11.0	1.0
4	1.0	883	60	1	0.0	100.0	90.0	NaN	0.0	1.0

	removed	observed	censored	entrance	at_risk
event_at
0.0	0	0	0	228	228
5.0	1	1	0	0	228
11.0	3	3	0	0	227
12.0	1	1	0	0	224
13.0	2	2	0	0	223
...	...	...	...	...	...
840.0	1	0	1	0	5
883.0	1	1	0	0	4
965.0	1	0	1	0	3
1010.0	1	0	1	0	2
1022.0	1	0	1	0	1

	KM_estimate
timeline
0.0	0.000000
5.0	0.004386
11.0	0.017544
12.0	0.021930
13.0	0.030702
...	...
840.0	0.932873
883.0	0.949654
965.0	0.949654
1010.0	0.949654
1022.0	0.949654

	coef	exp(coef)	se(coef)	coef lower 95%	coef upper 95%	exp(coef) lower 95%	exp(coef) upper 95%	z	p	-log2(p)
inst	-0.03	0.97	0.01	-0.06	-0.00	0.95	1.00	-2.31	0.02	5.60
age	0.01	1.01	0.01	-0.01	0.04	0.99	1.04	1.07	0.28	1.82
sex	-0.57	0.57	0.20	-0.96	-0.17	0.38	0.84	-2.81	<0.005	7.68
ph.ecog	0.91	2.48	0.24	0.44	1.38	1.55	3.96	3.80	<0.005	12.77
ph.karno	0.03	1.03	0.01	0.00	0.05	1.00	1.05	2.29	0.02	5.49
pat.karno	-0.01	0.99	0.01	-0.03	0.01	0.97	1.01	-1.34	0.18	2.47
meal.cal	0.00	1.00	0.00	-0.00	0.00	1.00	1.00	0.01	0.99	0.01
wt.loss	-0.02	0.98	0.01	-0.03	-0.00	0.97	1.00	-2.11	0.03	4.85

Concordance	0.65
Partial AIC	998.54
log-likelihood ratio test	33.70 on 8 df
-log2(p) of ll-ratio test	14.41