Titanic Survival Prediction → Customer Churn Prediction

Why Logistic Regression?
In real life, many decisions are yes/no:
- Will a passenger on the Titanic survive?
- Will a customer cancel their subscription?
- Will a patient develop a certain condition?
Logistic Regression is one of the simplest yet most powerful tools to answer these binary classification problems. It’s used across industries—from healthcare (disease prediction), finance (loan defaults), to tech (user retention).
In this tutorial, we’ll first explore Logistic Regression through the classic Titanic dataset, where the goal is to predict survival. Then we’ll scale it up to a real-world business problem: predicting customer churn.
How Logistic Regression Works
Unlike Linear Regression, which predicts continuous outcomes (like house prices), Logistic Regression predicts probabilities of belonging to a class.
At its core:
- Linear regression estimates
- Logistic regression transforms this using the sigmoid (logistic) function:
Here:
is the probability of the positive class (e.g., survival).
- Output ranges from 0 to 1, making it ideal for classification.
For multiple features:
The decision rule:
- If
, classify as 1 (positive class).
- If
, classify as 0 (negative class).
Cost Function:
Instead of Mean Squared Error, Logistic Regression uses Log Loss (Cross-Entropy Loss):
This ensures the model penalizes wrong predictions heavily, especially when the model is confident but wrong.
Toy Problem – Titanic Survival Prediction
The Titanic dataset is one of the most famous beginner datasets. Each passenger record contains features like age, gender, class, and survival status.
Dataset Snapshot
| PassengerId | Pclass | Sex | Age | SibSp | Parch | Fare | Survived |
|---|---|---|---|---|---|---|---|
| 1 | 3 | male | 22 | 1 | 0 | 7.25 | 0 |
| 2 | 1 | female | 38 | 1 | 0 | 71.28 | 1 |
| 3 | 3 | female | 26 | 0 | 0 | 7.92 | 1 |
- Survived = 1 → Passenger survived.
- Survived = 0 → Passenger did not survive.
Step 1: Load dataset
Loads Titanic data into a dataframe.
import pandas as pd
data = pd.read_csv("titanic.csv")
print(data.head())
Step 2: Prepare Dataset
Encodes categorical variables and handles missing values.
# Convert categorical to numeric
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
# Fill missing values
data['Age'].fillna(data['Age'].median(), inplace=True)
# Select features and labels
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']
Step 3: Train-test split
Splits dataset for training and testing.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Step 4: Train Logistic Regression model
Trains a logistic regression classifier.
from sklearn.linear_model import LogisticRegression
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
Step 5: Evaluate model
Evaluates accuracy and precision/recall.
from sklearn.metrics import accuracy_score, classification_report
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Quick Reference: Titanic Toy Problem Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
data = pd.read_csv("titanic.csv")
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data['Age'].fillna(data['Age'].median(), inplace=True)
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Real‑World Application — Customer Churn Prediction
Business Problem: A telecom company wants to know whether a customer will cancel their subscription (churn).
- Features: monthly charges, contract type, internet service, tenure, support tickets.
- Label: churn (1 = yes, 0 = no).
Step 1: Load dataset
Loads churn dataset.
df = pd.read_csv("customer_churn.csv")
print(df.head())
Step 2: Preprocess
Encodes categorical data and prepares features.
# Convert categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=True)
# Separate features and target
X = df.drop("Churn", axis=1)
y = df["Churn"]
Step 3: Train-test split
Splits data into train/test.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
Step 4: Train Logistic Regression model
Trains logistic regression.
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
Step 5: Evaluate
Measures prediction performance.
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Full Code Collection: Churn Application Code
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report
df = pd.read_csv("customer_churn.csv")
df = pd.get_dummies(df, drop_first=True)
X = df.drop("Churn", axis=1)
y = df["Churn"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))
Strengths & Limitations
Strengths
- Simple, interpretable, and widely used.
- Works well when classes are linearly separable.
- Outputs probabilities, not just labels.
Limitations
- Struggles with complex nonlinear relationships.
- Sensitive to outliers and multicollinearity.
- Performance degrades if features are highly imbalanced.
Final Notes
In this tutorial, we:
- Understood Logistic Regression theory (sigmoid, cost function).
- Applied it on the Titanic dataset to predict survival.
- Scaled it to a real-world business problem of customer churn.
Logistic Regression remains a cornerstone of applied AI: interpretable, efficient, and practical. one of the most applied AI techniques in business and research.
Next Steps for You:
Explore Regularization (L1/L2 penalties) in Logistic Regression to avoid overfitting.
Compare Logistic Regression with tree-based models (Decision Trees, Random Forests) for the same dataset.
References
[1] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2009.
[3] Kaggle Titanic Dataset. [Online]. Available: https://www.kaggle.com/c/titanic
[4] IBM Customer Churn Dataset. [Online]. Available: https://www.ibm.com/communities/analytics/watson-analytics-blog/customer-churn-dataset/

Leave a comment