Survival Analysis

TOC

  • Survival Analysis

  • CODE

    • Kaplan-Meier fitter

    • Kaplan-Meier fitter Based on Different Groups

    • Nelson-Aalen fitter Theory

    • Log-Rank-Test

    • Cox-Regression

    • AIC and model selection for parametric models

Survival Analysis

  • Staticial approaches to determine the time an event of interest to occurs. The event of interest could be death, recover, birth, retirement, etc.

What Analysis

  • To find number of days until patients showed symptoms.
  • To compare number of day of event of interest amoung various groups.
  • To find which treatment has the highest survival probability.
  • To find the median number of survial for patients.
  • To compare which factor has more impact on patients's survival.

Use case

Medical

  • Cancer patinents survial time analysis
  • Medical treatment time analysis

Operation

  • Machine failure time analysis
  • Operation production failure analysis
  • Machine critical level
  • Repair period analysis

Business

  • Warranty claim period analysis
  • Lead to sales time analysis
  • Hire to salesperson hires to first sale analysis
  • Hire to termination or quit time analysis

Other

  • Event history analysis

Concept

Survival time refers to as an amount of time until when a subject is alive or actively participates in a sutdy period.

Main types of events:

  1. Relapse (relapse-free survival): The length of time after treatment and get well.
  2. Progression (progression-free survival): The length of time after treatment and still lives with disease but not get worse.
  3. Death: Destruction or permanent end of study.

Censoring is a form of missing data problem in which time to event is not observed for reasons.

Survival and hazard functions

  1. Survival Function(S): The probability that an subject survives from the time origin (diagnosis of a disease) to a specified future time t. For example, S(100)=0.8 means that after 100 days, a subject’s survival probability is 0.8.

  2. Hazard Function (H): The probability that an subject who is under observation at a time t has an event (death) at that time. For example, If h(100) = 0.8 means that after 100 days, the probability of being dead is 0.8.

CODE

In [1]:
!pip install lifelines
Requirement already satisfied: lifelines in /usr/local/lib/python3.6/dist-packages (0.25.5)
Requirement already satisfied: scipy>=1.2.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.4.1)
Requirement already satisfied: autograd-gamma>=0.3 in /usr/local/lib/python3.6/dist-packages (from lifelines) (0.4.3)
Requirement already satisfied: autograd>=1.3 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.3)
Requirement already satisfied: numpy>=1.14.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.18.5)
Requirement already satisfied: matplotlib>=3.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (3.2.2)
Requirement already satisfied: pandas>=0.23.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (1.1.2)
Requirement already satisfied: patsy>=0.5.0 in /usr/local/lib/python3.6/dist-packages (from lifelines) (0.5.1)
Requirement already satisfied: future>=0.15.2 in /usr/local/lib/python3.6/dist-packages (from autograd>=1.3->lifelines) (0.16.0)
Requirement already satisfied: python-dateutil>=2.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (2.8.1)
Requirement already satisfied: kiwisolver>=1.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (1.2.0)
Requirement already satisfied: cycler>=0.10 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (0.10.0)
Requirement already satisfied: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in /usr/local/lib/python3.6/dist-packages (from matplotlib>=3.0->lifelines) (2.4.7)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.6/dist-packages (from pandas>=0.23.0->lifelines) (2018.9)
Requirement already satisfied: six in /usr/local/lib/python3.6/dist-packages (from patsy>=0.5.0->lifelines) (1.15.0)
In [2]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import urllib
from urllib import request
import lifelines
from lifelines import KaplanMeierFitter

print('lifelines version: ', lifelines.__version__)
lifelines version:  0.25.5
In [3]:
# Download data 

csv_src = "http://www-eio.upc.edu/~pau/cms/rdata/csv/survival/lung.csv"
csv_path = 'lung.csv'
urllib.request.urlretrieve(csv_src, csv_path)
Out[3]:
('lung.csv', <http.client.HTTPMessage at 0x7f26d1459160>)
In [4]:
#  Create dataframe

df = pd.read_csv(csv_path)

#  Precessing
df.loc[df.status == 1, 'dead'] = 0
df.loc[df.status == 2, 'dead'] = 1
df = df.drop(['Unnamed: 0', 'status'], axis=1)

df.head()
Out[4]:
inst time age sex ph.ecog ph.karno pat.karno meal.cal wt.loss dead
0 3.0 306 74 1 1.0 90.0 100.0 1175.0 NaN 1.0
1 3.0 455 68 1 0.0 90.0 90.0 1225.0 15.0 1.0
2 3.0 1010 56 1 0.0 90.0 90.0 NaN 15.0 0.0
3 5.0 210 57 1 1.0 90.0 60.0 1150.0 11.0 1.0
4 1.0 883 60 1 0.0 100.0 90.0 NaN 0.0 1.0
In [5]:
#  Get statistical info

df.describe()
Out[5]:
inst time age sex ph.ecog ph.karno pat.karno meal.cal wt.loss dead
count 227.000000 228.000000 228.000000 228.000000 227.000000 227.000000 225.000000 181.000000 214.000000 228.000000
mean 11.088106 305.232456 62.447368 1.394737 0.951542 81.938326 79.955556 928.779006 9.831776 0.723684
std 8.303491 210.645543 9.073457 0.489870 0.717872 12.327955 14.623177 402.174707 13.139902 0.448159
min 1.000000 5.000000 39.000000 1.000000 0.000000 50.000000 30.000000 96.000000 -24.000000 0.000000
25% 3.000000 166.750000 56.000000 1.000000 0.000000 75.000000 70.000000 635.000000 0.000000 0.000000
50% 11.000000 255.500000 63.000000 1.000000 1.000000 80.000000 80.000000 975.000000 7.000000 1.000000
75% 16.000000 396.500000 69.000000 2.000000 1.000000 90.000000 90.000000 1150.000000 15.750000 1.000000
max 33.000000 1022.000000 82.000000 2.000000 3.000000 100.000000 100.000000 2600.000000 68.000000 1.000000
In [6]:
#  Kaplan-Meier-Fitter
kmf = KaplanMeierFitter()
kmf.fit(durations = df.time, event_observed = df.dead)
Out[6]:
<lifelines.KaplanMeierFitter:"KM_estimate", fitted with 228 total observations, 63 right-censored observations>
In [7]:
#  Create event table

#  event_at: Timeline of dataset
#  removed:  The values of subject that are no longer part of our experiment
#  observed: The value of the number of subjects that died during the experiment
#  censored: The number of subject does not have an event during the observation time
#  entrance: The value of new subject in a given timeline
#  at_risk:  The number of current subject under observation

kmf.event_table
Out[7]:
removed observed censored entrance at_risk
event_at
0.0 0 0 0 228 228
5.0 1 1 0 0 228
11.0 3 3 0 0 227
12.0 1 1 0 0 224
13.0 2 2 0 0 223
... ... ... ... ... ...
840.0 1 0 1 0 5
883.0 1 1 0 0 4
965.0 1 0 1 0 3
1010.0 1 0 1 0 2
1022.0 1 0 1 0 1

187 rows × 5 columns

S(t) = (Number of subjects at risk at the start - Number of subjects that died) / Number of subjects at risk at the start

In [8]:
#  Calculate survival probability  for given time

print("Survival probablity for t=0: ", kmf.predict(0))
print("Survival probablity for t=5: ", kmf.predict(5))
print("Survival probablity for t=11: ", kmf.predict(11))
print("Survival probablity for t=840: ", kmf.predict(840))

print(kmf.predict([0,5,11,840]))
Survival probablity for t=0:  1.0
Survival probablity for t=5:  0.9956140350877193
Survival probablity for t=11:  0.9824561403508766
Survival probablity for t=840:  0.06712742409441387
0      1.000000
5      0.995614
11     0.982456
840    0.067127
Name: KM_estimate, dtype: float64
In [9]:
#  Get survial full list
print(kmf.survival_function_)
          KM_estimate
timeline             
0.0          1.000000
5.0          0.995614
11.0         0.982456
12.0         0.978070
13.0         0.969298
...               ...
840.0        0.067127
883.0        0.050346
965.0        0.050346
1010.0       0.050346
1022.0       0.050346

[187 rows x 1 columns]
In [10]:
#  Plot the survival graph

kmf.plot()
plt.title("Kaplan-Meier")
plt.xlabel("Number o days")
plt.ylabel("Probablity of survival")
plt.show()
In [11]:
#  Get survial time median

survival_time_median = kmf.median_survival_time_
print(" The median survial time: ", survival_time_median)
 The median survial time:  310.0
In [12]:
#  Survival probability wiht confidence interval

survival_confidence_interval = kmf.confidence_interval_survival_function_
survival_confidence_interval
Out[12]:
KM_estimate_lower_0.95 KM_estimate_upper_0.95
0.0 1.000000 1.000000
5.0 0.969277 0.999381
11.0 0.953935 0.993379
12.0 0.948120 0.990813
13.0 0.936682 0.985244
... ... ...
840.0 0.030728 0.123060
883.0 0.017866 0.108662
965.0 0.017866 0.108662
1010.0 0.017866 0.108662
1022.0 0.017866 0.108662

187 rows × 2 columns

In [13]:
#  Plot survival function with confidence interval

plt.plot(survival_confidence_interval['KM_estimate_lower_0.95'], label="Lower")
plt.plot(survival_confidence_interval['KM_estimate_upper_0.95'], label="Upper")
plt.title("Survival Function with confidence interval")
plt.xlabel("Number of days")
plt.ylabel("Survial Probablity")
plt.legend()
plt.show()
In [14]:
#  Cumulative survival
cumu_surv = kmf.cumulative_density_
cumu_surv
Out[14]:
KM_estimate
timeline
0.0 0.000000
5.0 0.004386
11.0 0.017544
12.0 0.021930
13.0 0.030702
... ...
840.0 0.932873
883.0 0.949654
965.0 0.949654
1010.0 0.949654
1022.0 0.949654

187 rows × 1 columns

In [15]:
#  Plot cumulative density

kmf.plot_cumulative_density()
plt.title("Cumulative density")
plt.xlabel("Number of days")
plt.ylabel("Probablity of death")
plt.show()
In [16]:
# Median time left for event

median_time_to_event = kmf.conditional_time_to_event_
plt.plot(median_time_to_event, label = "Median  time left")
plt.title("Median time to event")
plt.xlabel("Total days")
plt.ylabel("Conditional median time to event")
plt.legend()
plt.show()

Kaplan-Meier Estimator with groups

In [17]:
# KM for Male and Female

kmf_m = KaplanMeierFitter()
kmf_f = KaplanMeierFitter()

Male = df.query("sex == 1")
Female = df.query("sex == 2")

kmf_m.fit(durations=Male.time, event_observed=Male.dead, label="Male")
kmf_f.fit(durations=Female.time, event_observed=Female.dead, label="Female")
Out[17]:
<lifelines.KaplanMeierFitter:"Female", fitted with 90 total observations, 37 right-censored observations>
In [18]:
# Plot survival function

kmf_m.plot()
kmf_f.plot()
plt.title("KM Survival Probaility: Male vs Female")
plt.xlabel("Days")
plt.ylabel("Survival Probability")
plt.show()
In [19]:
#  Plot cumulative density

kmf_m.plot_cumulative_density()
kmf_f.plot_cumulative_density()
plt.title("Cumulative density")
plt.xlabel("Number of days")
plt.ylabel("Probablity of death")
plt.show()

Hazard function: Nelson-Aalen

In [20]:
from lifelines import NelsonAalenFitter
In [21]:
naf = NelsonAalenFitter()

naf.fit(durations = df.time, event_observed = df.dead)
Out[21]:
<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 228 total observations, 63 right-censored observations>

H(t) = Number of subjects that died / Number of subjects at risk

In [22]:
#  Cumulative hazard

naf.cumulative_hazard_
Out[22]:
NA_estimate
timeline
0.0 0.000000
5.0 0.004386
11.0 0.017660
12.0 0.022125
13.0 0.031114
... ...
840.0 2.641565
883.0 2.891565
965.0 2.891565
1010.0 2.891565
1022.0 2.891565

187 rows × 1 columns

In [23]:
#  Plot cumulative hazard

naf.plot_cumulative_hazard()
plt.title("Cumulative Hazard")
plt.xlabel("Number of days")
plt.ylabel("Cumulative Probability of death")
plt.show()
In [24]:
#  Predict

print("t = 500 days: ", naf.predict(500))
t = 500 days:  1.219546171331098
In [25]:
#  cumulative hazard with confidence interval

hazard_confidence_interval = naf.confidence_interval_

plt.plot(hazard_confidence_interval['NA_estimate_lower_0.95'], label="Lower")
plt.plot(hazard_confidence_interval['NA_estimate_upper_0.95'], label="Upper")
plt.title("Cumulative harzard Function with confidence interval")
plt.xlabel("Number of days")
plt.ylabel("Cumulative hazard")
plt.legend()
plt.show()
In [26]:
#  Cumulative hazard vs cumulative density

kmf.plot_cumulative_density(label="Cumulative Hazard")
naf.plot_cumulative_hazard(label="Cumulative Density")
plt.xlabel("Number of days")
plt.show()

Male vs Female

In [27]:
naf_m = NelsonAalenFitter()
naf_f = NelsonAalenFitter()

naf_m.fit(durations = Male.time, event_observed = Male.dead)
naf_f.fit(durations = Female.time, event_observed = Female.dead)
Out[27]:
<lifelines.NelsonAalenFitter:"NA_estimate", fitted with 90 total observations, 37 right-censored observations>
In [28]:
#  Plot cumulative hazard Male vs Female

naf_m.plot_cumulative_hazard(label="Male")
naf_f.plot_cumulative_hazard(label="Female")
plt.title("Cumulative Hazard")
plt.xlabel("Number of days")
plt.ylabel("Cumulative Probability of death")
plt.show()

Log-rank test

Log-rank test is a hypothesis test survival distribution of two samples.

A p-value between 0 and 1 denotes the statistical significance. The smaller the p-value, the more significant the statistical difference between groups being studied is.

Less than (5% = 0.05) P-value means there is a significant difference between the groups we compared.

In [29]:
from lifelines.statistics import logrank_test

time_m = Male.time
event_m = Male.dead

time_f = Female.time
event_f = Female.dead

results = logrank_test(time_m, time_f, event_m, event_f)
results.print_summary()
print("P-value: ", results.p_value)
t_0 -1
null_distribution chi squared
degrees_of_freedom 1
test_name logrank_test
test_statistic p -log2(p)
0 10.33 <0.005 9.57
P-value:  0.001311164520355457

survival function for both groups is significantly different.

Cox-proportional hazard model

The cox-proportional hazard model is a regression model generally used by medical researchers to determine the relationship between the survival time of a subject and one or more predictor variables.

h(t) = h0(t) * exp(b1X1 + b2X2 + ... + bnXn)

Where,

t = survivaltime

h(t) = hazard function

X1, X2, ..., Xn = covariates

b1, b2, ..., bn = measures the impact of covariates

In [30]:
from lifelines import CoxPHFitter

df2 = pd.read_csv("lung.csv")
df2 = df2.drop(['Unnamed: 0'], axis=1)
df2 = df2.dropna(subset=['inst', 'time', 'status', 'age', 'sex', 'ph.ecog', 'ph.karno', 'pat.karno', 'meal.cal', 'wt.loss'])
df2.loc[df2.status == 1, 'dead'] = 0
df2.loc[df2.status == 2, 'dead'] = 1
df2 = df2.drop(['status'], axis=1)

df2.head()
Out[30]:
inst time age sex ph.ecog ph.karno pat.karno meal.cal wt.loss dead
1 3.0 455 68 1 0.0 90.0 90.0 1225.0 15.0 1.0
3 5.0 210 57 1 1.0 90.0 60.0 1150.0 11.0 1.0
5 12.0 1022 74 1 1.0 50.0 80.0 513.0 0.0 0.0
6 7.0 310 68 2 2.0 70.0 60.0 384.0 10.0 1.0
7 11.0 361 71 2 2.0 60.0 80.0 538.0 1.0 1.0
In [31]:
df2.describe()
Out[31]:
inst time age sex ph.ecog ph.karno pat.karno meal.cal wt.loss dead
count 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000 167.000000
mean 10.706587 309.934132 62.568862 1.383234 0.958084 82.035928 79.580838 929.125749 9.718563 0.718563
std 8.167900 209.435591 9.210706 0.487637 0.731011 12.778885 15.104188 413.489837 13.378587 0.451053
min 1.000000 5.000000 39.000000 1.000000 0.000000 50.000000 30.000000 96.000000 -24.000000 0.000000
25% 3.000000 174.500000 57.000000 1.000000 0.000000 70.000000 70.000000 619.000000 0.000000 0.000000
50% 11.000000 268.000000 64.000000 1.000000 1.000000 80.000000 80.000000 975.000000 7.000000 1.000000
75% 15.000000 419.500000 70.000000 2.000000 1.000000 90.000000 90.000000 1162.500000 15.000000 1.000000
max 32.000000 1022.000000 82.000000 2.000000 3.000000 100.000000 100.000000 2600.000000 68.000000 1.000000
In [32]:
cph = CoxPHFitter()
cph.fit(df2, duration_col='time', event_col='dead')
cph.print_summary()
model lifelines.CoxPHFitter
duration col 'time'
event col 'dead'
baseline estimation breslow
number of observations 167
number of events observed 120
partial log-likelihood -491.27
time fit was run 2020-10-11 01:38:38 UTC
coef exp(coef) se(coef) coef lower 95% coef upper 95% exp(coef) lower 95% exp(coef) upper 95% z p -log2(p)
inst -0.03 0.97 0.01 -0.06 -0.00 0.95 1.00 -2.31 0.02 5.60
age 0.01 1.01 0.01 -0.01 0.04 0.99 1.04 1.07 0.28 1.82
sex -0.57 0.57 0.20 -0.96 -0.17 0.38 0.84 -2.81 <0.005 7.68
ph.ecog 0.91 2.48 0.24 0.44 1.38 1.55 3.96 3.80 <0.005 12.77
ph.karno 0.03 1.03 0.01 0.00 0.05 1.00 1.05 2.29 0.02 5.49
pat.karno -0.01 0.99 0.01 -0.03 0.01 0.97 1.01 -1.34 0.18 2.47
meal.cal 0.00 1.00 0.00 -0.00 0.00 1.00 1.00 0.01 0.99 0.01
wt.loss -0.02 0.98 0.01 -0.03 -0.00 0.97 1.00 -2.11 0.03 4.85

Concordance 0.65
Partial AIC 998.54
log-likelihood ratio test 33.70 on 8 df
-log2(p) of ll-ratio test 14.41
In [33]:
#  All covariates forest plot

cph.plot()
plt.show()
In [34]:
#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='sex', values=[1, 2], cmap='coolwarm')
plt.title("Survival curves comparison: sex")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()
In [35]:
#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='ph.ecog', values=[0, 1, 2, 3], cmap='coolwarm')
plt.title("Survival curves comparison: ph.ecog")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()
In [36]:
#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='inst', values=[1, 3, 11, 15, 32], cmap='coolwarm')
plt.title("Survival curves comparison: inst")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()
In [37]:
#  Plotting the effect of varying a covariate

cph.plot_partial_effects_on_outcome(covariates='ph.karno', values=[50, 70, 80, 90, 100], cmap='coolwarm')
plt.title("Survival curves comparison: ph.karno")
plt.xlabel("Number of days")
plt.ylabel("Survival")
plt.legend()
plt.show()

AIC and model selection for parametric models

  • semi-parametric models (Cox model and Aalen’s model)

Cox Proportional Hazards regression can still fit survival models without knowing (or assuming) the distribution.

  • a family of parametric models

a log-normal, or a Weibull, or other parametric distribution make that assumption the maximum likelihood parametric approach

In [38]:
from lifelines import LogLogisticAFTFitter, WeibullAFTFitter, LogNormalAFTFitter

llf = LogLogisticAFTFitter().fit(df2, 'time', 'dead')
lnf = LogNormalAFTFitter().fit(df2, 'time', 'dead')
wf = WeibullAFTFitter().fit(df2, 'time', 'dead')

print(llf.AIC_)
print(lnf.AIC_)
print(wf.AIC_)
1681.6308575665694
1694.8367594080894
1668.3198611758
In [38]: