Predicting Survival Outcomes Using Logistic Regression

Author

Yafet Mekonnen (Advisor: Dr. Cohen)

Published

March 23, 2026

Introduction

Surviving or not, sick or healthy many real life outcomes come down to a simple binary choice of “yes” or “no.” Logistic regression is one of the most widely used methods for predicting these kinds of binary results because it is designed specifically for categorical outcomes. In contrast, Multiple Regression Analysis relies on a continuous outcome variable, meaning the output can fall anywhere on a numerical range. This makes it impossible to use linear regression to predict binary outcomes because the model does not restrict predictions to 0 or 1. As (Dayton 1992) explains, linear regression applied to a categorical outcome can produce values such as 0.5, 0.33, negative numbers, or values greater than 1 all of which are uninterpreted as probabilities. This limitation is exactly why logistic regression is needed. To solve this issue, logistic regression applies the logarithm of the odds, also known as the log odds (logit) function, which converts probabilities into a scale that behaves more linearly with respect to the predictors. This makes it possible to model the relationship between variables and the outcome more accurately. After the log odds are calculated, the logistic function transforms them back into valid probabilities between 0 and 1, giving a meaningful estimate of the likelihood that the event occurs. Researchers such as (Joshi and Dhakal 2021), who predicted type 2 diabetes, have shown how logistic regression helps identify which factors increase the chances of an outcome happening. Because it is a supervised learning method, logistic regression uses labeled data to learn relationships and estimate the probability of an event, making it a reliable approach for studying health related predictions.

Similar to other machine learning models, logistic regression requires the data to be cleaned and well prepared before the model is run. The preparation work such as handling missing values and selecting meaningful predictors plays a major role in how well the model performs. As (Shipe et al. 2019) emphasize, logistic regression models perform poorly when data quality is low or when predictors are chosen without careful consideration. Their work highlights the importance of using predictors that are relevant and correlated with the outcome to ensure the model identifies the most meaningful factors. (Stoltzfus 2011) points out that logistic regression depends on several key assumptions, such as a linear relationship between continuous predictors and the log odds of the target outcome, the absence of multicollinearity, and the lack of extreme outliers that can affect the model. Additionally, (Bewick, Cheek, and Ball 2005) note that logistic regression models must be checked for overall goodness of fit, and they emphasize the need for a sufficient number of outcome events per independent variable to avoid overfitting an issue that often weakens model performance in medical prediction studies.

The main goal of this project is to build a logistic regression model that predicts a patient’s likelihood of developing heart disease using the Framingham Heart Study dataset. By identifying which features have the strongest correlation with the target outcome, the model can highlight risk patterns that support early prevention and clinical decision making. As mentioned (Tripepi et al. 2008) describe, logistic regression is specifically designed for binary medical outcomes, such as whether a patient does or does not develop a disease, and their example involving chronic kidney disease demonstrates how the method estimates risk by comparing differences across patient groups. In a similar way, this project examines how factors such as sex, age, BMI, smoking status, cholesterol levels, and other clinical indicators contribute to the probability of heart disease occurrence. The dataset used in this study is publicly available on Kaggle and originates from the long running Framingham Heart Study, which includes over 4,000 patient records and 15 demographic, behavioral, and medical attributes used to predict a patient’s 10 year risk of coronary heart disease (CHD). Additionally, the findings from (Wang and Jayathilaka 2024) show that logistic regression can effectively identify key predictors of cardiovascular outcomes, reinforcing its use for this type of analysis. Overall, this project demonstrates how logistic regression can transform real world patient data into meaningful insights that help explain why some individuals face a higher risk of heart disease than others

Methods

The model that is going to be applied is the supervised machine learning method called Logistic Regression to predict a binary outcome (0 or 1), since it is appropriate when the target variable has only two possible values. The linear form of the model is written as:\(y_i = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \dots + \beta_k X_k\) since the outcome is binary, logistic regression uses the log-odds (logit) to transform the function written as

\(\log\left(\frac{p}{1 - p}\right) = \beta_0 + \beta_1 X_1 + \dots + \beta_k X_k\)

Where:

\(p\) is the probability that the event occurs
\(\beta_0\) is the intercept
\(\beta_1, \dots, \beta_k\) are coefficients
\(X_1, \dots, X_k\) are predictor variables

A logistic regression model requires multiple steps to achieve the best performance. First, the data must be cleaned, and any missing values should be handled typically by replacing them with zero,mean or median. After pre processing,the model is then trained using the training data and later used to predict outcomes on the testing data to evaluate its performance.

Model performance is evaluated using standard classification metrics. Accuracy is calculated as \(Accuracy = \frac{TP + TN}{TP + TN + FP + FN}\), precision is calculated as \(Precision = \frac{TP}{TP + FP}\), recall is calculated as \(Recall = \frac{TP}{TP + FN}\), and the F1 score is calculated as \(F1 = 2 \times \frac{Precision \times Recall}{Precision + Recall}\). A confusion matrix is used to compare predicted and actual outcomes.The ROC curve and AUC score help measure how good the model is at correctly separating the two groups.

Analysis and Results

Data Exploration and Visualization

The dataset used for this project is publicly available on Kaggle and originates from (FraminghamHeartStudy), a large cardiovascular research project involving residents of Framingham, Massachusetts. The study followed participants over a 10 year period between 1958 to 1968 and recorded whether each individual eventually developed coronary heart disease (CHD). The dataset contains a little over 4,000 patient records and 15 attributes that capture a range of demographic, behavioral, and medical characteristics. Each feature represents a potential risk factor, allowing the model to analyze how different aspects of a patient’s health profile relate to the long term development of heart disease.

Code

library(readr)
library(dplyr)
library(gt)

df <- read_csv("framingham.csv",show_col_types = FALSE)


variables <- colnames(df)


table <- data.frame(
  Variable = variables,
  Description = c(
    "Sex of the participant, (1 = male, 0 = female).",
    "Age of the participant.",
    "Education level (1 = some high school, 2 = high school graduate or GED, 3 = some college, 4 = college graduate).",
    "Indicates whether or not the patient is a current smoker.",
    "Average number of cigarettes smoked per day.",
    "Whether the participant is taking medication for blood pressure.",
    "Whether the participant has ever experienced a stroke.",
    " whether or not the patient was hypertensive.",
    "Whether the participant has been diagnosed with diabetes.",
    "Cholesterol level of the participant.",
    "Systolic blood pressure.",
    "Diastolic blood pressure.",
    "Body Mass Index (BMI).",
    "Heart rate of the participant.",
    "Glucose level.",
    "Outcome variable indicating whether the participant developed coronary heart disease (CHD) within 10 years (1 = Yes, 0 = No)."
  )
)

table %>%
  gt() %>%
  tab_header(
    title = "Table: Description of Variables in the Framingham Heart Study Dataset"
  )

Table: Description of Variables in the Framingham Heart Study Dataset
Variable	Description
male	Sex of the participant, (1 = male, 0 = female).
age	Age of the participant.
education	Education level (1 = some high school, 2 = high school graduate or GED, 3 = some college, 4 = college graduate).
currentSmoker	Indicates whether or not the patient is a current smoker.
cigsPerDay	Average number of cigarettes smoked per day.
BPMeds	Whether the participant is taking medication for blood pressure.
prevalentStroke	Whether the participant has ever experienced a stroke.
prevalentHyp	whether or not the patient was hypertensive.
diabetes	Whether the participant has been diagnosed with diabetes.
totChol	Cholesterol level of the participant.
sysBP	Systolic blood pressure.
diaBP	Diastolic blood pressure.
BMI	Body Mass Index (BMI).
heartRate	Heart rate of the participant.
glucose	Glucose level.
TenYearCHD	Outcome variable indicating whether the participant developed coronary heart disease (CHD) within 10 years (1 = Yes, 0 = No).

Code

import pandas as pd
import seaborn as sns        
import matplotlib.pyplot as plt  
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFE
from sklearn.model_selection import train_test_split



df = pd.read_csv("framingham.csv")


counts = df['TenYearCHD'].value_counts().sort_index()


total = counts.sum()
labels = ["No (0)", "Yes (1)"]

pct = counts / total * 100
labels_with_values = [f"{l}\n{c} ({p:.1f}%)" for l, c, p in zip(labels, counts, pct)]

plt.pie(
    counts.values,
    labels=labels_with_values,
    startangle=90,
    colors=sns.color_palette("Reds")
);
plt.title("Ten Year Heart Disease Outcomes")
plt.tight_layout()
plt.show()

The above pie chart indicate that shows how many people developed heart disease over ten years. Most people did not get heart disease 3,594 people, which is 84.8% of the group. A smaller number 644 people, or 15.2% did develop heart disease. Overall, the chart tells us that only a small part of the group ended up with heart disease during the ten‑year period.

Code

df["BPMeds"] = pd.to_numeric(df["BPMeds"], errors="coerce")
df = df.dropna(subset=["BPMeds"]).copy()
df["BPMeds"] = df["BPMeds"].astype(int)


df['sex'] = df['male'].astype(int)

df['sex_label'] = df['sex'].map({1: 'Male', 0: 'Female'})


binary_features = ["sex", "currentSmoker", "BPMeds", "prevalentStroke", "prevalentHyp", "diabetes"]  
target = "TenYearCHD" 

subplot_titles = {
    "sex": "Sex",
    "currentSmoker": "Current Smoker",
    "BPMeds": "Blood Pressure Meds",
    "prevalentStroke": "Prevalent Stroke",
    "prevalentHyp": "Prevalent Hypertension",
    "diabetes": "Diabetes"
}

label_map = {
    "sex": {0: "Female", 1: "Male"},     
    "default": {0: "No", 1: "Yes"}        
}

df[target] = pd.to_numeric(df[target], errors='coerce').astype('Int64')

sns.set_theme(style="whitegrid", context="talk")  

fig, axes = plt.subplots(2, 3, figsize=(16, 12))
axes = axes.flatten()  

for ax, col in zip(axes, binary_features):
    sub = df[[col, target]].dropna(subset=[col, target]).copy()
    
    x_order = [0, 1]
    hue_order = [0, 1]

    sns.countplot(
        data=sub, x=col, hue=target, ax=ax,
        order=x_order, hue_order=hue_order, palette='Blues'
    )

    ax.set_title(subplot_titles.get(col, col), fontsize=16)

    if col == "sex":
        ax.set_xticks([0, 1], labels=[label_map["sex"][0], label_map["sex"][1]])
    else:
        ax.set_xticks([0, 1], labels=[label_map["default"][0], label_map["default"][1]])

    ax.set_xlabel("")          
    ax.set_ylabel("Count")     

    heights = [p.get_height() for p in ax.patches if p.get_height() is not None]
    max_h = max(heights) if heights else 1
    y_off = 0.03 * max_h  
    for p in ax.patches:
        h = p.get_height()
        if h is None or h <= 0:
            continue  
        x = p.get_x() + p.get_width() / 2
        ax.text(x, h + y_off, f"{int(h)}", ha="center", va="bottom", fontsize=12)


for i, ax in enumerate(axes):
    leg = ax.legend(title=target, loc='upper right', fontsize=20, title_fontsize=20)
    if i != len(binary_features) - 1 and leg is not None:
        ax.legend_.remove()

fig.suptitle("Count of Ten Year Heart Disease Outcomes for binary features", fontsize=30, y=1)
plt.tight_layout()
plt.show()

The chart above compares different yes‑or‑no health features with ten‑year heart disease outcomes. In all six groups (sex, smoking, blood‑pressure medicine, stroke, hypertension, and diabetes), most people did not get heart disease. The lighter bars show people without heart disease, and these bars are much taller in every category. The darker bars show people who did get heart disease, and these bars are smaller in every group. This means that, no matter which feature you look at, only a small number of people ended up with heart disease over ten years.

Code

continuous_features = [
    "age", "education", "cigsPerDay", "totChol", "sysBP",
    "diaBP", "BMI", "heartRate", "glucose"
]
target = "TenYearCHD"

subplot_titles = {
    "age": "Age (years)",
    "education": "Education (1–4 levels)",
    "cigsPerDay": "Cigarettes per Day",
    "totChol": "Total Cholesterol (mg/dL)",
    "sysBP": "Systolic Blood Pressure (mmHg)",
    "diaBP": "Diastolic Blood Pressure (mmHg)",
    "BMI": "Body Mass Index (kg/m²)",
    "heartRate": "Resting Heart Rate (bpm)",
    "glucose": "Glucose (mg/dL)"
}

df[target] = pd.to_numeric(df[target], errors="coerce").astype("Int64")

sns.set_theme(style="whitegrid", context="talk")

ncols = 3
nrows = int(np.ceil(len(continuous_features) / ncols))
fig, axes = plt.subplots(nrows=nrows, ncols=ncols, figsize=(16, 12))
axes = axes.flatten()

last_plot_index = len(continuous_features) - 1

for i, (ax, col) in enumerate(zip(axes, continuous_features)):
    sub = df[[col, target]].dropna(subset=[col, target]).copy()

    if sub.empty:
        ax.text(0.5, 0.5, f"No data for '{col}'", ha="center", va="center", fontsize=12)
        ax.set_xticks([]); ax.set_yticks([])
        ax.set_xlabel(""); ax.set_ylabel("")
        continue

    g = sns.histplot(
        data=sub,
        x=col,
        hue=target,
        bins="auto",
        kde=False,
        multiple="layer", 
        palette="Purples",
        alpha=0.6,
        edgecolor="white",
        ax=ax
    )

    ax.set_xlabel(subplot_titles.get(col, col), fontsize=12)
    ax.set_ylabel("Count", fontsize=12)
    ax.tick_params(axis="both", labelsize=11)

    leg = ax.legend_
    if i != last_plot_index and leg is not None:
        ax.legend_.remove()

for j in range(len(continuous_features), nrows * ncols):
    axes[j].axis("off")

fig.suptitle("Count of Ten Year Heart Disease Outcomes for Continuous Features", fontsize=30, y=1)

plt.tight_layout()
plt.show()

The chart above shows different health measurements and how many people with each measurement did or did not get heart disease in ten years. For every feature like age, blood pressure, heart rate, cholesterol, BMI, and glucose there are many more people who did not get heart disease than people who did. It is shown that the light purple bars for people without heart disease are much taller than the dark purple bars for people with heart disease. This tells us that heart disease cases are fewer across all these measurements.Overall, the chart shows that heart disease is less common in the dataset, no matter which feature you look at.

Modeling and Results

Data Preparation

There are seven features that contain missing values, including education, cigarettes per day, blood pressure medication status, total cholesterol, body mass index (BMI), heart rate, and glucose level. To prepare the data for modeling, these missing values were replaced using appropriate statistical methods based on the type of each variable. For categorical variables and ordinal variable such as blood pressure medication status and education level, the mode was used to replace missing values because it represents the most frequent category and helps preserve the original distribution of the data. For continuous variables such as cigarettes per day, cholesterol, BMI, heart rate, and glucose, the median was used because medical measurements often contain extreme values. The median provides a more stable and representative estimate of the typical value compared to the mean.

Code


df_model = pd.read_csv("framingham.csv")

df_model["education"] = df_model["education"].fillna(df_model["education"].mode()[0])

df_model["BPMeds"] = df_model["BPMeds"].fillna(df_model["BPMeds"].mode()[0])

df_model["cigsPerDay"] = df_model["cigsPerDay"].fillna(df_model["cigsPerDay"].median())

df_model["totChol"] = df_model["totChol"].fillna(df_model["totChol"].median())

df_model["BMI"] = df_model["BMI"].fillna(df_model["BMI"].median())

df_model["heartRate"] = df_model["heartRate"].fillna(df_model["heartRate"].median())

df_model["glucose"] = df_model["glucose"].fillna(df_model["glucose"].median())

#df_model.isnull().sum() # Missing values sum

For feature selection, both correlation and RFE were used to find the most important predictors. The data was first split into 80% training data and 20% testing data, which gave 3,390 patients in the training set and 848 patients in the testing set. The predictor variables in both sets were then scaled so they would be on the same measurement scale. After that, feature selection was done on the training data. From RFE, the top five features were male, age, cigarettes per day (cigsPerDay), systolic blood pressure (sysBP), and glucose.From the correlation method, the top five features were age, systolic blood pressure (sysBP), prevalent hypertension (prevalentHyp), diastolic blood pressure (diaBP), and glucose. The final set of predictors was taken as the union of both results, which gave seven total features.

Code

X = df_model.drop("TenYearCHD", axis=1)
y = df_model["TenYearCHD"]

#Train 80%
# Test 20 %

X_train, X_test, y_train, y_test = train_test_split(
    X, y,
    test_size=0.2,
    random_state=42,
    stratify=y
)

scaler = StandardScaler()

X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)


model = LogisticRegression(max_iter=100)

rfe = RFE(model, n_features_to_select=5)
rfe.fit(X_train_scaled, y_train);

selected_features = X.columns[rfe.support_]

train_df = X_train.copy()
train_df["TenYearCHD"] = y_train  # add target column back to it 

corr = train_df.corr()
top_corr = corr["TenYearCHD"].abs().sort_values(ascending=False)


top_corr_features = list(top_corr.index[1:6])
rfe_features = list(selected_features)

final_features = [col for col in X.columns if col in set(top_corr_features).union(set(rfe_features))]

#print("Final selected features:", final_features)

Code

#print(df.columns)

Code

#df["TenYearCHD"].value_counts(normalize=True)

References

Bewick, Viv, Liz Cheek, and Jonathan Ball. 2005. “Statistics Review 14: Logistic Regression.” Critical Care 9 (1): 112–18. https://doi.org/10.1186/cc3045.

Dayton, C Mitchell. 1992. “Logistic Regression Analysis.” Stat 474: 574.

FraminghamHeartStudy. “Heart Disease Prediction.” https://www.kaggle.com/datasets/dileep070/heart-disease-prediction-using-logistic-regression; Kaggle.

Joshi, Ram D, and Chandra K Dhakal. 2021. “Predicting Type 2 Diabetes Using Logistic Regression and Machine Learning Approaches.” International Journal of Environmental Research and Public Health 18 (14): 7346.

Shipe, Maren E, Stephen A Deppen, Farhood Farjah, and Eric L Grogan. 2019. “Developing Prediction Models for Clinical Use Using Logistic Regression: An Overview.” Journal of Thoracic Disease 11 (Suppl 4): S574.

Stoltzfus, Jill C. 2011. “Logistic Regression: A Brief Primer.” Academic Emergency Medicine 18 (10): 1099–1104. https://doi.org/10.1111/j.1553-2712.2011.01185.x.

Tripepi, Giovanni, KJ Jager, FW Dekker, and Carmine Zoccali. 2008. “Linear and Logistic Regression Analysis.” Kidney International 73 (7): 806–10.

Wang, Haolin, and Aruni Jayathilaka. 2024. “Exploring Predictive Factors for Heart Disease: A Comprehensive Analysis Using Logistic Regression.” University of Rochester Journal of Undergraduate Research 23 (1). https://doi.org/10.47761/YRCU1073.