Business Analytics¶

Why This Matters¶

Every business decision rests on a claim about reality: "raising our price will reduce sales," "this marketing campaign will lift conversion," "men are paid more than women at our firm." Business Analytics gives you the quantitative machinery to test those claims with data rather than intuition. The course builds one tool -- regression analysis -- from a single equation with two variables all the way to classification models, time series forecasting, and controlled experiments. By the end, you will be able to open a data set, fit a model, diagnose its quality, and translate the output into an actionable recommendation -- or explain to a room of executives exactly why the data does not support their pet theory.

The textbook (Quantitative Methods for Management by Canela, Alegre, and Ibarra) is deliberately pragmatic: it treats regression as an algorithm executed via software, skips derivations, and focuses relentlessly on interpretation and business judgment. That philosophy shapes this entry.

How It All Connects¶

The course follows a clean progression. You start with the simplest possible model -- a single regression line relating one X to one Y (Sessions 1--2). You then add complexity: multiple independent variables (Session 2), statistical testing of coefficients (Session 3), categorical data via dummy variables (Session 4), and time dependence via lagged variables and trends (Session 5). Sessions 6--8 pivot to classification -- predicting group membership (yes/no, churn/stay, default/pay) rather than a continuous number. Session 9 introduces A/B testing -- the gold standard for establishing causation. Session 10 confronts the ethical limits of algorithmic decision-making.

Cross-references abound: - Marketing Management: Pricing sensitivity (Session 1's Greenchips case), customer segmentation, churn modeling (Session 7), and A/B testing of campaigns are all direct applications of regression and classification. - Operations Management: Demand forecasting (Session 5's time series), exponential smoothing, and moving averages are the backbone of production planning and inventory management. - Corporate Finance: Beta estimation in the Capital Asset Pricing Model (CAPM) is a simple linear regression of stock returns on market returns. Adjusted R-squared tells you how much systematic risk the market explains. - Decision Analysis: Regression provides the probability estimates that feed into decision trees and expected value calculations. Bayes' Theorem connects to the confusion matrix logic of classification.

Session 1: The Regression Line¶

Case: Greenchips -- price versus sales sensitivity

Core Concept¶

Simple linear regression finds the single straight line that best fits a set of data points (X, Y). The equation is:

Y = a + bX

Symbol	Name	Meaning
Y	Dependent variable	The outcome you want to predict or explain
X	Independent variable	The factor you believe drives Y
a	Intercept	The predicted value of Y when X = 0
b	Slope	The average change in Y for a one-unit increase in X

Interpreting the Coefficients¶

The slope (b) is where the action is. It tells you: "For every additional unit of X, Y changes by b units on average." If b > 0, the association is positive; if b < 0, negative. In the Greenchips case, a negative slope means higher price leads to lower sales -- exactly what economic theory predicts.

The intercept (a) is the predicted value of Y when X = 0. The textbook explicitly warns: rarely pay attention to the intercept. In most business contexts, X = 0 is outside the range of your data and the intercept is a dangerous extrapolation (e.g., "predicted weight for a person with height = 0 cm" is nonsense).

The Method of Least Squares¶

The regression algorithm chooses a and b to minimize the Sum of Squared Residuals (SSR). A residual is the prediction error for a single observation:

Residual = Actual value - Predicted value

Properties of the regression line: - The sum of all residuals equals zero (positive and negative errors cancel out perfectly). - Points above the line have positive residuals; points below have negative residuals. - The line passes through the point (mean of X, mean of Y).

Correlation (R) and R-Squared (R^2)¶

The regression decomposes the total variance in Y into two parts:

Var(Actual values) = Var(Predicted values) + Var(Residuals)

Metric	Formula	Range	Interpretation
R (correlation)	--	-1 to +1	Direction and strength of linear association; same sign as slope
R^2 (coefficient of determination)	1 - Var(Residuals) / Var(Actual values)	0 to 1	Proportion of variance in Y explained by the model

Reading R^2 in plain English: An R^2 of 0.75 means "75% of the variation in sales can be explained by price; 25% is unexplained noise." There is no universal threshold for "good enough" -- it depends on the domain. An R^2 of 0.30 in social science may be excellent; in a physics experiment, 0.95 might be mediocre. In business, focus on whether the model's predictions are accurate enough to support the decision at hand.

Back-of-napkin: Before trusting any regression, do three sanity checks: (1) Does the sign of b make business sense? (2) Is R^2 high enough to be useful? (3) Is the sample large and representative?

Cross-reference -- Corporate Finance: Estimating a stock's beta is exactly this: regress stock returns (Y) on market returns (X). The slope is beta; R^2 tells you how much of the stock's risk is systematic versus idiosyncratic.

Session 2: Multiple Linear Regression¶

Case: Barcelona Real Estate

Core Concept¶

When more than one factor drives your dependent variable, you extend the equation:

Y = a + b1X1 + b2X2 + ... + bkXk

Symbol	Name	Meaning
Y	Dependent variable	e.g., apartment price
X1, X2, ..., Xk	Independent variables	e.g., square meters, number of rooms, distance to center
b1, b2, ..., bk	Slope coefficients	Average change in Y for a one-unit increase in that Xi, holding all other X's constant
a	Intercept	Predicted Y when all X's equal zero
R	Multiple correlation	Overall goodness-of-fit; always positive; ranges from 0 to 1

Interpreting Coefficients: The "Holding All Else Constant" Clause¶

In multiple regression, each coefficient bi tells you the partial effect of Xi on Y, assuming all other variables are unchanged. This is both the power and the trap:

Power: You can isolate the effect of one variable (e.g., square meters) while controlling for others (e.g., location, age of building).
Trap: If two X variables move together in reality, the assumption of "holding one constant while changing the other" is unrealistic. Adding or dropping a variable will cause the other coefficients to shift.

Multiple R vs. R-Squared vs. Adjusted R-Squared¶

Adding any variable to the model will always increase R (and R^2), even if that variable is random noise. This creates a false sense of improvement.

Metric	Formula	Key Property
R^2	Var(Predicted) / Var(Actual)	Always increases when you add variables
Adjusted R^2	1 - [(1 - R^2)(n - 1)] / (n - k - 1)	Penalizes for adding useless variables; can decrease

Symbol	Definition
n	Number of observations (sample size)
k	Number of independent variables

Rule of thumb: If Adjusted R^2 drops when you add a variable, that variable is not helping -- drop it.

Back-of-napkin: Always compare Adjusted R^2 across competing models, not raw R^2. If adding a variable barely moves Adjusted R^2, the added complexity is not worth it.

Session 3: Testing Regression Coefficients¶

Case: Orange juice market share

The Core Question¶

You have estimated a coefficient from sample data, but does it reflect a real relationship in the population, or could it be zero (pure noise)?

Confidence Intervals¶

Because your regression is based on a sample, the estimated coefficients have sampling error. The software reports a Standard Error (SE) for each coefficient, and you build a 95% confidence interval:

95% Confidence Interval = Coefficient +/- (2 x SE)

(The factor is approximately 2 for large samples; technically it comes from the Student t-distribution.)

Decision rule: If the 95% confidence interval does not contain zero, the coefficient is statistically significant -- you can conclude the variable has a real effect on Y.

p-Values¶

The p-value is a probability (range 0 to 1) that answers: "If this variable truly had zero effect, how likely would I be to observe a coefficient this large (or larger) just by chance?"

p-value	Interpretation
p < 0.05	Statistically significant (by universal consensus)
p = 0.057	Borderline -- "maybe with a bigger sample..."
p = 0.315 or p = 0.623	Not significant -- both mean the same thing operationally

The p < 0.05 rule is equivalent to the 95% confidence interval rule. They will always give the same answer.

The t-Statistic¶

The t-statistic is an intermediate calculation: t = Coefficient / SE. The textbook explicitly dismisses it as having "no interest in business applications." The software uses it to compute the p-value. Focus on p-values and confidence limits.

What Significance Is Not¶

Statistical significance is not the same as practical relevance. With a massive data set (hundreds of thousands of observations), tiny, operationally useless effects will register as "significant." Conversely, with a small sample, genuinely important effects may fail the p < 0.05 test. Decisions must be made even without perfect statistical significance -- quantitative analysis supports managerial judgment; it does not replace it.

The F-Test for Overall Model Significance¶

While each coefficient has its own p-value, the F-test asks: "Is this entire model, taken as a whole, doing better than just guessing the average?"

F = [R^2 / k] / [(1 - R^2) / (n - k - 1)]

Component	Meaning
Numerator	Variance explained per independent variable
Denominator	Unexplained variance per remaining degree of freedom
H0 (null hypothesis)	All slope coefficients are simultaneously zero (b1 = b2 = ... = bk = 0)
Decision	If p-value of F < 0.05, reject H0; the model as a whole is significant

Note: The textbook considers the ANOVA table "irrelevant in business applications," but the F-test is a standard exam topic.

Session 4: Dummy Variables¶

Case: Scandia gender pay gap

The Problem¶

Regression math cannot process text like "Male/Female" or "Germany/France." You need to convert categorical variables into numbers.

The Solution: Dummy Variables¶

A dummy variable takes only two values: 1 (belongs to this group) or 0 (does not belong). For a categorical variable with k groups, you create exactly k - 1 dummies. The omitted group is the baseline (reference category).

Group	D (Male dummy)
Female	0
Male	1

Why k - 1? If you include k dummies for k groups, you create perfect multicollinearity (the dummies sum to 1, which is a linear function of the intercept). The software will crash or produce garbage.

Interpreting Dummy Coefficients¶

For the equation Y = a + bD: - When D = 0 (baseline group, e.g., Female): predicted Y = a. The intercept is the average Y for the baseline group. - When D = 1 (e.g., Male): predicted Y = a + b. The coefficient b is the average difference between the dummy group and the baseline.

With additional variables (Y = a + bD + cX): - b is the average difference between the two groups for a given value of X (i.e., holding X constant). - Example from the Scandia case: with Tenure controlled, the Male dummy coefficient of +$7,907 means males earn $7,907 more than females at the same tenure level.

Three or More Groups¶

For k groups, create k - 1 dummies. Example with marital status (Single, Married, Divorced), using Single as the baseline:

Group	D1 (Married)	D2 (Divorced)
Single	0	0
Married	1	0
Divorced	0	1

In the equation Y = a + b1D1 + b2D2: - a = average Y for Singles - b1 = average difference (Married minus Single) - b2 = average difference (Divorced minus Single)

Back-of-napkin: Name your dummy after the group coded as 1 (e.g., MALE, MARRIED). This prevents confusion about which direction the difference runs.

Session 5: Time Series and Lagged Variables¶

Case: Sales Trend at Guarini

Core Concept¶

A time series is data observed over time (weekly sales, monthly prices, quarterly revenue). Time series analysis exploits the fact that past values help predict future values -- observations are not independent.

Decomposition¶

A time series can be broken into two components:

Component	Definition	Modeled By
Trend	Underlying stable tendency over time	TIME as an independent variable
Seasonality	Recurrent calendar patterns (e.g., December spikes)	Dummy variables or multiplicative factors

Trend Models¶

Parametric trends use a fixed mathematical function of time (t):

Trend Type	Equation	Meaning
Linear	Y = a + bt	Constant growth rate in units per period
Quadratic	Y = a + bt + ct^2	Growth accelerates (c > 0) or decelerates (c < 0) over time

Warning: Parametric trends put the same weight on events that happened years ago as on recent events. If the environment has changed, old data can severely bias your forecast. Drop very old data or switch to a nonparametric trend.

Nonparametric trends use moving averages or exponential smoothing to adapt to recent data:

Exponential Smoothing Formula:

sm(t) = alpha x(t) + (1 - alpha) sm(t - 1)

Symbol	Definition
sm(t)	Smoothed value at time t
x(t)	Actual observation at time t
sm(t - 1)	Previous smoothed value
alpha	Smoothing parameter (0 < alpha < 1)

alpha = 0.2 (common in business): 20% weight on current data, 80% on history. Produces a very smooth trend that filters out noise.
alpha = 0.7: Reacts quickly to sudden changes but amplifies noise.

Lagged Variables¶

A lagged variable is a past value of the same series used as a predictor. For example, predicting this month's sales using last month's sales:

Y(t) = a + b Y(t - 1)

The coefficient b captures persistence: how much of last period's level carries forward.
The sum of lag coefficients (if using multiple lags) is a measure of overall persistence or momentum.
If b is close to 1, the series is highly persistent (today looks a lot like yesterday). If b is close to 0, past values carry little predictive power.

Seasonality¶

Additive seasonality: The seasonal effect is a fixed amount added to or subtracted from the trend.

Predicted value = Trend value + Seasonal

Use when the amplitude of seasonal fluctuations is roughly constant over time.

Multiplicative seasonality: The seasonal effect is a factor that scales with the trend level.

Predicted value = Trend value x Seasonal

A seasonal factor of 1.2 means sales in that month are typically 20% above trend; 0.8 means 20% below. Use when fluctuations grow or shrink as the trend rises or falls -- which is the norm with monthly sales data.

Back-of-napkin: Forecasting methods are not effective for long-term predictions (e.g., sales five years from now). Use your model to predict the next period, then recalibrate with new observations.

Cross-reference -- Operations Management: Moving averages and exponential smoothing are the same methods used in demand planning. The alpha parameter trades off responsiveness versus stability -- exactly the dilemma an operations manager faces when setting safety stock levels.

Sessions 6--7: Classification Models¶

Cases: UW Health Hospital (tumor detection); Customer churn (cost/benefit analysis)

The Shift from Regression to Classification¶

In regression, your dependent variable is continuous (sales, price, salary). In classification, it is binary -- coded as 1 (positive: defaults, churns, is spam) or 0 (negative: pays, stays, is legitimate).

Step 1: Generate a Score¶

You run a regression with a binary dependent variable. Each observation gets a predictive score that functions as a ranking -- how much does this observation "look like" a positive case?

Linear regression flaw: The score can fall below 0 or above 1, which makes no sense as a probability.

Logistic regression fix: A nonlinear method that constrains the score strictly within the 0-to-1 range, so it can be interpreted as a probability or propensity score.

The Logistic (Logit) Function:

Logistic regression models the log-odds (logit) of the event occurring:

ln[P / (1 - P)] = a + b1X1 + b2X2 + ... + bkXk

Symbol	Definition
P	Probability that Y = 1
1 - P	Probability that Y = 0
P / (1 - P)	The odds of the event
ln[P / (1 - P)]	The log-odds (logit)

This transforms the linear equation into an S-shaped (sigmoid) curve that asymptotes at 0 and 1 -- no impossible probabilities.

Odds ratios: In logistic regression, you exponentiate the coefficient to get the odds ratio. An odds ratio of 1.5 means that for every one-unit increase in X, the odds of the event multiply by 1.5 (a 50% increase in odds).

Step 2: Set a Cutoff Threshold¶

To convert the continuous score into a binary prediction, you pick a cutoff:

Score > Cutoff --> Predict Positive (Y = 1) Score < Cutoff --> Predict Negative (Y = 0)

Although 0.5 seems like the obvious default, the optimal cutoff depends entirely on business context.

Step 3: Evaluate with the Confusion Matrix¶

Cross-tabulate actual outcomes against predicted outcomes:

	Actual Positive	Actual Negative
Predicted Positive	True Positive (TP)	False Positive (FP) -- "False Alarm"
Predicted Negative	False Negative (FN) -- "Missed Target"	True Negative (TN)

Key Performance Indicators (KPIs)¶

Metric	Formula	Plain English
Accuracy	(TP + TN) / Total	Overall % correct
True Positive Rate (Sensitivity, Recall)	TP / (TP + FN)	Of all actual positives, what % did we catch?
False Positive Rate	FP / (FP + TN)	Of all actual negatives, what % did we falsely flag?
Specificity (True Negative Rate)	TN / (TN + FP)	Of all actual negatives, what % did we correctly clear?
Precision	TP / (TP + FP)	Of all predicted positives, what % are actually positive?

Warning on Accuracy: If the event is rare (e.g., only 1% of loans default), you can achieve 99% accuracy by predicting "no default" for everyone -- which defeats the entire purpose of the model. Always look at Sensitivity and the False Positive Rate alongside Accuracy.

The ROC Curve and AUC¶

The Receiver Operating Characteristic (ROC) curve plots Sensitivity (y-axis) against the False Positive Rate (x-axis) across every possible cutoff threshold.

Area Under the Curve (AUC) = 0.5: The model is useless (equivalent to flipping a coin).
AUC = 1.0: The model perfectly separates the two classes.
AUC between 0.7 and 0.8: Acceptable discrimination in most business contexts.

The Cost-Benefit Analysis of Thresholds¶

You do not blindly accept 0.5 as the cutoff. You adjust it based on the relative cost of each type of error:

Lowering the threshold --> more Predicted Positives --> catches more True Positives (fewer Missed Targets) but generates more False Alarms.
Raising the threshold --> more Predicted Negatives --> fewer False Alarms but misses more actual positives.

Example (loan default): Giving a loan to someone who defaults (False Negative) destroys a large amount of capital. Denying a loan to a good customer (False Positive) only costs you the margin on that loan. Therefore, you lower the threshold to catch more defaulters, accepting more false alarms as the lesser evil.

Back-of-napkin: Before setting a cutoff, ask: "Which mistake is more expensive -- a False Positive or a False Negative?" Then tilt the threshold toward minimizing the costlier error.

Cross-reference -- Marketing Management: Churn modeling is a classification problem. The marketing team identifies likely churners (Predicted Positive) and targets them with retention offers. The cost of a False Positive (offering a discount to someone who was going to stay anyway) is much lower than the cost of a False Negative (losing a high-value customer).

Session 8: HR Analytics¶

Case: Barney -- demotivation and stress analysis

Application: Multiple Regression in People Analytics¶

This session applies the full regression toolkit to human resources data. The dependent variables are employee outcomes (motivation, stress, performance), and the independent variables include job characteristics, management practices, demographics, and compensation.

Key Analytical Moves¶

Dummy variables for categorical HR data: Department, job level, gender, and education level are all encoded as dummies with a reference category.
Coefficient interpretation: "Holding all other factors constant, employees in Department X have a stress score 3.2 points higher than those in the baseline department."
Multicollinearity check: HR variables often correlate heavily (tenure with salary, education with job level). Run a correlation matrix and flag pairs with |R| >= 0.85.
Significance testing: Use p-values to determine which factors genuinely drive demotivation versus which are statistical noise.
Practical versus statistical significance: A coefficient may be statistically significant (p < 0.05) but too small to justify an organizational intervention. Always assess magnitude alongside significance.

Session 9: A/B Testing¶

Case: Vungle -- algorithm comparison

Core Concept¶

An A/B test is a randomized controlled experiment with two conditions:

Condition	Label	Example
A (Control)	Current policy/design	Old website layout
B (Treatment)	Proposed policy/design	New website layout

The purpose is to answer a specific causal question: "What is the average effect of using version B instead of version A on outcome Y?"

Why Randomization Matters¶

Because you use random assignment, the two groups will only differ in the specific manipulation you introduce. Everything else (demographics, behavior, preferences) is similar on average. This allows you to draw causal inferences -- not just correlations.

Execution via Regression¶

While the industry standard is the t-test, using regression is actually more powerful because you can control for additional variables:

Define a treatment dummy variable: D = 1 if assigned to condition A (or B -- just be consistent), D = 0 otherwise.
Run: Y = a + bD
The intercept a = average outcome in the control group.
The coefficient b = the causal effect of the treatment (average difference between groups).
Test significance using the p-value of b.

Hypothesis Testing Framework¶

Concept	Definition
Null Hypothesis (H0)	There is no effect; the treatment makes no difference (b = 0)
Alternative Hypothesis (H1)	The treatment does have an effect (b != 0)
Significance Level (alpha)	The threshold you set in advance (usually 0.05) for rejecting H0
p-value	The probability of observing a result this extreme if H0 were actually true
Type I Error (False Positive)	You reject H0 when it is actually true -- you conclude the treatment works when it does not
Type II Error (False Negative)	You fail to reject H0 when it is actually false -- you miss a real effect

Decision rule: If p-value < alpha (0.05), reject H0 and conclude the treatment effect is statistically significant.

Independent t-Test vs. Paired t-Test¶

Test	When to Use	Example
Independent t-test	Comparing averages of two completely separate groups	Group 1 sees Ad A; Group 2 sees Ad B
Paired t-test	Comparing averages of the same group measured at two different times	Same employees measured before and after a training program

The paired t-test is more powerful because it controls for individual differences (each person serves as their own control).

Back-of-napkin: An A/B test is only as good as its randomization. If the groups differ systematically before the treatment, the coefficient b captures selection bias, not the treatment effect. Always check that observable characteristics are balanced across groups.

Session 10: AI Ethics and Algorithmic Bias¶

Reading: ProPublica -- Machine Bias (COMPAS recidivism algorithm)

The Problem¶

Algorithms trained on historical data inherit the biases embedded in that data. When deployed for consequential decisions -- criminal sentencing, hiring, lending, insurance pricing -- biased algorithms can systematically disadvantage protected groups.

The COMPAS Case¶

ProPublica's investigation of the COMPAS (Correctional Offender Management Profiling for Alternative Sanctions) recidivism prediction algorithm found: - Black defendants were nearly twice as likely to be falsely flagged as future criminals (higher False Positive Rate). - White defendants were more likely to be incorrectly labeled low risk when they actually did reoffend (higher False Negative Rate). - Overall accuracy was similar across racial groups -- but the distribution of errors was starkly unequal.

Key Ethical Concepts¶

Concept	Definition
Algorithmic bias	Systematic errors in an algorithm's output that create unfair outcomes for specific groups
Fairness	No single mathematical definition exists; different fairness criteria (equal accuracy, equal false positive rates, equal false negative rates) are often mutually exclusive
Transparency	Can stakeholders understand how the algorithm makes decisions? Black-box models resist scrutiny
Accountability	Who is responsible when an algorithm causes harm -- the developer, the company deploying it, or the data provider?
Human oversight	Data-Driven Decision-Making (DDDM) uses analytics alongside human judgment; extreme reliance on pure calculation without ethical oversight can ignore the rights of minorities

The Fairness Impossibility Problem¶

You cannot simultaneously equalize all fairness metrics across groups (except in trivial cases). Optimizing for equal accuracy across races may produce unequal false positive rates. Business leaders must make an explicit choice about which errors matter most -- and that choice is a values decision, not a technical one.

Practical Takeaways¶

Audit your training data for historical bias before building any classification model.
Examine the confusion matrix by subgroup (race, gender, age) -- overall accuracy can mask wildly different error rates.
Do not confuse correlation with causation. If your algorithm uses zip code as a predictor, and zip code correlates with race, you may be building a racially biased model without ever including race as a variable.
Document your modeling choices. Which cutoff did you use? Why? What is the cost-benefit tradeoff? Who bears the cost of errors?

Back-of-napkin: Before deploying any classification model in a high-stakes context, ask: "If I computed the confusion matrix separately for each demographic group, would the False Positive and False Negative rates be roughly equal? If not, who is being harmed, and is that acceptable?"

Multicollinearity: A Deeper Look¶

This concept appears in Sessions 2, 3, and 8. Because it is the most common trap in applied regression, it deserves a dedicated section.

What It Is¶

Multicollinearity occurs when one independent variable is (exactly or approximately) equal to a linear function of the other independent variables in the model. In plain terms: two or more X variables move together so tightly that the model cannot separate their individual effects.

Why It Matters¶

Consequence	Explanation
Redundancy	The collinear variable provides no extra information beyond what the others already capture
Unreliable coefficients	Confidence intervals widen dramatically; coefficients may flip sign or become non-significant
Interpretation breakdown	The "holding all else constant" assumption makes no sense if two variables always move together

How to Detect It¶

Correlation matrix: Compute pairwise correlations among all independent variables. Flag any pair with |R| >= 0.85 (conservative rule from the course slides).
Coefficient sign check: If a coefficient has a sign that contradicts business logic (e.g., higher advertising spending predicts lower sales), multicollinearity is a likely culprit.
Variance Inflation Factor (VIF): Regress each Xi on all the other X's; VIF_i = 1 / (1 - R_i^2). A VIF > 5 is a warning; VIF > 10 indicates severe multicollinearity.

How to Fix It¶

Drop the redundant variable and rerun the model. The textbook's advice is simple and direct: if two variables are collinear, one of them is not adding value. Remove it.

Overfitting and Model Validation¶

What Is Overfitting?¶

Overfitting occurs when a model fits the current data too well -- it memorizes the noise, not just the signal -- and performs significantly worse on new data. This typically happens with: - Too many variables relative to the sample size - Models that are too complex for the amount of data available

The Train/Test Split¶

To prove your model is not overfitted, never evaluate it on the same data you used to build it.

Randomly split your data (e.g., 50/50 or 70/30) into a Training set and a Test set.
Build the regression equation using only the Training set.
Lock the equation and apply it to the Test set.
Compare Accuracy, Sensitivity, and False Positive Rate between the two sets.

If performance on the Test set closely matches the Training set, your model is robust and ready for deployment. If performance drops substantially, you have overfitted.

Back-of-napkin: More variables is not always better. Every variable you add increases the risk of overfitting. Use Adjusted R^2, p-value screening, and train/test validation to keep models lean.

Formulas Reference¶

Simple Linear Regression¶

Formula	Purpose
Y = a + bX	The regression equation
Residual = Actual - Predicted	Prediction error for each observation
1 - R^2 = Var(Residuals) / Var(Actual)	Proportion of variance unexplained
R^2 = Var(Predicted) / Var(Actual)	Proportion of variance explained

Multiple Regression¶

Formula	Purpose
Y = a + b1X1 + b2X2 + ... + bkXk	Multiple regression equation
Adjusted R^2 = 1 - [(1 - R^2)(n - 1)] / (n - k - 1)	Penalized goodness-of-fit
F = [R^2 / k] / [(1 - R^2) / (n - k - 1)]	Overall model significance test

Coefficient Testing¶

Formula	Purpose
95% CI = Coefficient +/- (2 x SE)	Confidence interval for a coefficient
t = Coefficient / SE	t-statistic (intermediate calculation for p-value)
VIF_i = 1 / (1 - R_i^2)	Multicollinearity diagnostic

Classification¶

Formula	Purpose
ln[P / (1 - P)] = a + b1X1 + ... + bkXk	Logistic regression (logit model)
Accuracy = (TP + TN) / Total	Overall classification correctness
Sensitivity = TP / (TP + FN)	True positive rate
Specificity = TN / (TN + FP)	True negative rate
False Positive Rate = FP / (FP + TN)	Rate of false alarms

Time Series¶

Formula	Purpose
Y = a + bt	Linear trend
Y = a + bt + ct^2	Quadratic trend
sm(t) = alpha x(t) + (1 - alpha) sm(t - 1)	Exponential smoothing
Predicted = Trend + Seasonal	Additive seasonality
Predicted = Trend x Seasonal	Multiplicative seasonality

The Complete Diagnostic Checklist¶

Before presenting any regression result to a business audience, run through this sequence:

Does it make sense? Check the sign of every coefficient against business logic. If price has a positive coefficient on sales, something is wrong (unless you are in a Veblen-good market).
Is it significant? Check p-values. If p > 0.05 for a coefficient, consider dropping that variable and rerunning.
Is the model useful? Check R (or Adjusted R^2). If the model explains very little variance, it may not be worth acting on.
Is there multicollinearity? Run a correlation matrix on the X variables. Flag pairs above |R| = 0.85. If coefficients have wrong signs, suspect multicollinearity.
Is it overfit? Split the data into train/test. If performance degrades sharply on the test set, simplify the model.
Is it fair? For classification models, compute the confusion matrix by subgroup. Check for disparate error rates.
Is it actionable? Statistical significance does not equal practical relevance. A coefficient of 0.001 may be significant with n = 1,000,000 but useless for decision-making.

Quick Reference¶

Simple Linear Regression (Y = a + bX): The slope b measures the average change in Y per one-unit increase in X. The intercept a is rarely meaningful. → See: Session 1 The Regression Line
R-Squared (R^2): Proportion of variance in Y explained by the model. No universal "good" threshold -- depends on domain and decision context. → See: Session 1 The Regression Line
Multiple Regression (Y = a + b1X1 + ... + bkXk): Each coefficient is a partial effect, holding all other variables constant. Always compare models using Adjusted R^2, not raw R^2. → See: Session 2 Multiple Linear Regression
Adjusted R^2: Penalizes adding useless variables. If it drops when you add a variable, drop that variable. → See: Session 2 Multiple Linear Regression
p-Value < 0.05: The coefficient is statistically significant -- the variable has a real effect on Y (not just noise). → See: Session 3 Testing Regression Coefficients
Dummy Variables: Encode categorical data as 0/1. For k groups, create k - 1 dummies. The coefficient measures the average difference versus the baseline group. → See: Session 4 Dummy Variables
Exponential Smoothing (sm(t) = alpha x(t) + (1 - alpha) sm(t-1)): Low alpha = smooth trend (filters noise); high alpha = responsive to recent changes. → See: Session 5 Time Series and Lagged Variables
Logistic Regression: Classification model that constrains predicted probabilities between 0 and 1 using the logit function. Exponentiate coefficients to get odds ratios. → See: Sessions 6--7 Classification Models
Confusion Matrix (TP/FP/TN/FN): Cross-tabulates actual versus predicted outcomes. Accuracy alone is misleading for rare events -- always check Sensitivity and False Positive Rate. → See: Sessions 6--7 Classification Models
ROC Curve and AUC: AUC = 0.5 is useless (coin flip); AUC = 1.0 is perfect. AUC between 0.7 and 0.8 is acceptable for most business applications. → See: Sessions 6--7 Classification Models
Cutoff Threshold: Do not default to 0.5. Set it based on the relative cost of False Positives versus False Negatives. → See: Sessions 6--7 Classification Models
A/B Testing: Randomized controlled experiment. The treatment dummy coefficient is the causal effect. Reject H0 if p < 0.05. → See: Session 9 AB Testing
Multicollinearity: Two X variables moving together. Detect with |R| >= 0.85 or VIF > 5. Fix by dropping the redundant variable. → See: Multicollinearity A Deeper Look
Train/Test Split: Build the model on training data, evaluate on held-out test data. If performance drops sharply, the model is overfit. → See: Overfitting and Model Validation
Algorithmic Bias: Audit confusion matrices by demographic subgroup. Overall accuracy can mask unequal error rates across protected groups. → See: Session 10 AI Ethics and Algorithmic Bias

Glossary¶

Term	Definition
AUC	Area Under the (ROC) Curve; overall measure of a classification model's discriminatory power
COMPAS	Correctional Offender Management Profiling for Alternative Sanctions; recidivism prediction algorithm
DDDM	Data-Driven Decision-Making
DV	Dependent Variable
ETS	Exponential Smoothing (also called Exponential Trend Smoothing in some contexts)
FN	False Negative
FP	False Positive
IV	Independent Variable
KPI	Key Performance Indicator
OLS	Ordinary Least Squares; the method that minimizes the sum of squared residuals
ROC	Receiver Operating Characteristic
SE	Standard Error
SSR	Sum of Squared Residuals
TN	True Negative
TP	True Positive
VIF	Variance Inflation Factor