Logistic Regression: Predicting Probabilities with Linear Boundaries

3–5 minutes

Titanic Survival Prediction → Customer Churn Prediction

Why Logistic Regression?

In real life, many decisions are yes/no:

  • Will a passenger on the Titanic survive?
  • Will a customer cancel their subscription?
  • Will a patient develop a certain condition?

Logistic Regression is one of the simplest yet most powerful tools to answer these binary classification problems. It’s used across industries—from healthcare (disease prediction), finance (loan defaults), to tech (user retention).

In this tutorial, we’ll first explore Logistic Regression through the classic Titanic dataset, where the goal is to predict survival. Then we’ll scale it up to a real-world business problem: predicting customer churn.

How Logistic Regression Works

Unlike Linear Regression, which predicts continuous outcomes (like house prices), Logistic Regression predicts probabilities of belonging to a class.

At its core:

  • Linear regression estimates Y = \beta_0 + \beta_1X + \epsilon
  • Logistic regression transforms this using the sigmoid (logistic) function:

p = \dfrac{1}{1+e^{-(\beta_0 + \beta_1X)}}

Here:

  • p is the probability of the positive class (e.g., survival).
  • Output ranges from 0 to 1, making it ideal for classification.

For multiple features:

p = \dfrac{1}{1+e^{-(\beta_0 + \beta_1X_1 + \beta_2X_2 + \cdots + \beta_nX_n)}}

The decision rule:

  • If p \geq 0.5, classify as 1 (positive class).
  • If p < 0.5, classify as 0 (negative class).

Cost Function:
Instead of Mean Squared Error, Logistic Regression uses Log Loss (Cross-Entropy Loss):

J(\beta) = -\dfrac{1}{n}\sum_{i=1}^n \Big[ y_i \log(\hat{y}_i) + (1-y_i)\log(1-\hat{y}_i) \Big]

This ensures the model penalizes wrong predictions heavily, especially when the model is confident but wrong.

Toy Problem – Titanic Survival Prediction

The Titanic dataset is one of the most famous beginner datasets. Each passenger record contains features like age, gender, class, and survival status.

Dataset Snapshot

PassengerIdPclassSexAgeSibSpParchFareSurvived
13male22107.250
21female381071.281
33female26007.921
  • Survived = 1 → Passenger survived.
  • Survived = 0 → Passenger did not survive.

Step 1: Load dataset

Loads Titanic data into a dataframe.

import pandas as pd

data = pd.read_csv("titanic.csv")
print(data.head())

Step 2: Prepare Dataset

Encodes categorical variables and handles missing values.

# Convert categorical to numeric
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})

# Fill missing values
data['Age'].fillna(data['Age'].median(), inplace=True)

# Select features and labels
X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']

Step 3: Train-test split

Splits dataset for training and testing.

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 4: Train Logistic Regression model

Trains a logistic regression classifier.

from sklearn.linear_model import LogisticRegression

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

Step 5: Evaluate model

Evaluates accuracy and precision/recall.

from sklearn.metrics import accuracy_score, classification_report

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Quick Reference: Titanic Toy Problem Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

data = pd.read_csv("titanic.csv")
data['Sex'] = data['Sex'].map({'male': 0, 'female': 1})
data['Age'].fillna(data['Age'].median(), inplace=True)

X = data[['Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare']]
y = data['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

model = LogisticRegression(max_iter=200)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Real‑World Application — Customer Churn Prediction

Business Problem: A telecom company wants to know whether a customer will cancel their subscription (churn).

  • Features: monthly charges, contract type, internet service, tenure, support tickets.
  • Label: churn (1 = yes, 0 = no).

Step 1: Load dataset

Loads churn dataset.

df = pd.read_csv("customer_churn.csv")
print(df.head())

Step 2: Preprocess

Encodes categorical data and prepares features.

# Convert categorical variables using one-hot encoding
df = pd.get_dummies(df, drop_first=True)

# Separate features and target
X = df.drop("Churn", axis=1)
y = df["Churn"]

Step 3: Train-test split

Splits data into train/test.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

Step 4: Train Logistic Regression model

Trains logistic regression.

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

Step 5: Evaluate

Measures prediction performance.

y_pred = model.predict(X_test)

print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Full Code Collection: Churn Application Code

import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, classification_report

df = pd.read_csv("customer_churn.csv")
df = pd.get_dummies(df, drop_first=True)

X = df.drop("Churn", axis=1)
y = df["Churn"]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

model = LogisticRegression(max_iter=500)
model.fit(X_train, y_train)

y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
print(classification_report(y_test, y_pred))

Strengths & Limitations

Strengths

  • Simple, interpretable, and widely used.
  • Works well when classes are linearly separable.
  • Outputs probabilities, not just labels.

Limitations

  • Struggles with complex nonlinear relationships.
  • Sensitive to outliers and multicollinearity.
  • Performance degrades if features are highly imbalanced.

Final Notes

In this tutorial, we:

  • Understood Logistic Regression theory (sigmoid, cost function).
  • Applied it on the Titanic dataset to predict survival.
  • Scaled it to a real-world business problem of customer churn.

Logistic Regression remains a cornerstone of applied AI: interpretable, efficient, and practical. one of the most applied AI techniques in business and research.

Next Steps for You:

Explore Regularization (L1/L2 penalties) in Logistic Regression to avoid overfitting.

Compare Logistic Regression with tree-based models (Decision Trees, Random Forests) for the same dataset.

References

[1] C. Bishop, Pattern Recognition and Machine Learning, Springer, 2006.
[2] T. Hastie, R. Tibshirani, J. Friedman, The Elements of Statistical Learning, Springer, 2009.
[3] Kaggle Titanic Dataset. [Online]. Available: https://www.kaggle.com/c/titanic
[4] IBM Customer Churn Dataset. [Online]. Available: https://www.ibm.com/communities/analytics/watson-analytics-blog/customer-churn-dataset/

Leave a comment