Math, Probability & Statistical Modeling
Formula: P(A) = (Number of favorable outcomes) / (Total number of outcomes)
Key Characteristics of Probability:
- 0 ≤ P(A) ≤ 1 for any event A.
- P(sample space) = 1 (something must happen).
- If A and B are mutually exclusive: P(A or B) = P(A) + P(B).
- Complement rule: P(not A) = 1 − P(A).
Types of Probability:
- Classical: All outcomes equally likely (fair coin, dice).
- Empirical: Based on observed frequency. P(A) = frequency of A / total observations.
- Subjective: Based on personal judgment or expert opinion.
- Conditional: P(A|B) = P(A ∩ B) / P(B) — probability of A given B has occurred.
| Feature | Descriptive Statistics | Inferential Statistics |
|---|---|---|
| Purpose | Describe/summarize data | Draw conclusions about population |
| Scope | Only given dataset | Generalizes beyond the sample |
| Tools | Mean, median, mode, SD | Hypothesis testing, regression, CI |
| Example | Average score of 30 students | Predict nationwide average from 30 students |
1. Normal Distribution (Gaussian):
2. Binomial Distribution:
Formula: P(X=k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ
Example: Probability of getting exactly 3 heads in 5 coin flips.
3. Categorical Distribution:
Naive Bayes Variants:
| Variant | Assumes Features Follow | Used For |
|---|---|---|
| GaussianNB | Normal (Gaussian) distribution | Continuous features |
| BernoulliNB | Bernoulli (binary) distribution | Binary/boolean features (0 or 1) |
| MultinomialNB | Multinomial distribution | Count data (text classification) |
r = +1: Perfect positive correlation | r = 0: No correlation | r = −1: Perfect negative correlation
Interpretation of r:
| r value | Interpretation |
|---|---|
| 0.9 to 1.0 | Very strong positive |
| 0.7 to 0.9 | Strong positive |
| 0.4 to 0.7 | Moderate positive |
| 0 to 0.4 | Weak positive |
| 0 | No linear correlation |
| −0.4 to 0 | Weak negative |
| −1.0 to −0.7 | Strong negative |
Other Correlation Methods:
- Spearman's Rank Correlation: Non-parametric; works on ranked data. Used when data is not normally distributed.
- Kendall's Tau: Based on concordant/discordant pairs. More robust for small datasets with ties.
- Point-Biserial: Correlation between a continuous and a binary variable.
Impact on Regression: Outliers can heavily distort regression lines, inflate standard errors, and mislead predictions. A single extreme point can pull the regression line toward it, giving a poor fit for the rest of the data.
Detection Techniques:
- Z-Score Method: If |z| > 3, the point is an outlier. z = (x − μ) / σ. Works best for normally distributed data.
- IQR Method (Box Plot):
- IQR = Q3 − Q1
- Lower bound = Q1 − 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
- Any value outside these bounds = outlier.
- Scatter Plot: Visually inspect data; outliers appear far from the cluster.
- Box Plot: Points beyond the whiskers are outliers.
- DBSCAN: Points that cannot be assigned to any cluster (noise points) are outliers.
- Isolation Forest: ML method — outliers are isolated faster (fewer splits needed in a random tree).
Handling Outliers: Remove them (if data entry error), cap them (winsorization), transform them (log transform), or use robust models (median instead of mean).
Why needed?
- High-dimensional data leads to the "curse of dimensionality" — models become slower and less accurate.
- Many features may be redundant or irrelevant.
- Visualization is only possible in 2D/3D.
Methods:
- PCA (Principal Component Analysis):
- Linear technique. Finds new axes (principal components) that capture maximum variance.
- PC1 = direction of greatest variance. PC2 = orthogonal to PC1, next most variance.
- Reduces correlated features into uncorrelated components.
- Parametric: assumes data is linear.
- Factor Analysis:
- Identifies latent (hidden) factors that explain correlations among observed variables.
- E.g., "intelligence" as a hidden factor explaining scores in Math, Reading, Writing.
- Used for understanding underlying structure, not just compression.
- Feature Selection vs Feature Extraction:
| Feature | Feature Selection | Feature Extraction |
|---|---|---|
| Method | Selects a subset of original features | Creates new features from original |
| Interpretability | High (original features kept) | Lower (new abstract features) |
| Example | Select top 5 of 20 features | PCA transforms 20 features → 5 components |
| Techniques | Filter, Wrapper, Embedded methods | PCA, LDA, Autoencoders |
- Linear Algebra Approach (SVD / Matrix Decomposition): Singular Value Decomposition decomposes a matrix into U, Σ, Vᵀ. Keeping only the top-k singular values gives a lower-dimensional approximation. Used in recommendation systems, text analysis (LSA).
1. Linear Regression:
Simple Linear: y = mx + b (one independent variable)
Multiple Linear: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ
Example: Predict house price based on area. As area increases, price increases linearly.
2. Non-Linear Regression:
Example: y = ax² + bx + c (Polynomial regression). Used when data shows curves — e.g., population growth, drug dose-response curves.
3. Logistic Regression:
Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)
Example: Predict if email is spam (1) or not (0). Predict if a patient has a disease.
4. OLS (Ordinary Least Squares):
Residual = actual y − predicted y. OLS minimizes Σ(residuals)².
| Method | Output | Use Case |
|---|---|---|
| Linear Regression | Continuous value | House price prediction |
| Non-linear Regression | Continuous value (curved) | Growth curves, biological models |
| Logistic Regression | Probability (0–1) → class | Spam detection, disease diagnosis |
| OLS | Regression coefficients | Parameter estimation |
Examples: Daily stock prices, monthly sales, annual temperature, hourly website traffic, ECG signals.
Components of Time Series:
- Trend: Long-term upward or downward movement. E.g., population growing over decades.
- Seasonality: Regular repeating patterns tied to calendar time. E.g., higher ice cream sales in summer every year.
- Cyclical: Long-term wave-like fluctuations (not fixed periods). E.g., economic boom-bust cycles over several years.
- Irregular/Noise: Random, unpredictable variation. E.g., sudden spike in sales due to a viral social media post.
Time Series Analysis Techniques:
- Moving Average (MA): Smooths data by averaging over a window of time periods. Removes short-term fluctuations to reveal trends.
- Exponential Smoothing: Weighted average giving more importance to recent observations. Good for short-term forecasting.
- ARIMA (Auto-Regressive Integrated Moving Average):
- AR: uses past values to predict future.
- I: differencing to make data stationary.
- MA: models error as a moving average of past errors.
- Seasonal Decomposition: Separates the series into Trend + Seasonal + Residual components.
- ACF/PACF: Autocorrelation and partial autocorrelation plots — used to identify the order of AR and MA components.
Using Clustering to Subdivide Data
Goal: Discover natural groupings in data without labels.
Applications: Customer segmentation, document grouping, gene expression analysis, anomaly detection, image segmentation.
Characteristics of Good Clustering:
- High intra-cluster similarity (within group, objects are similar).
- Low inter-cluster similarity (between groups, objects are dissimilar).
- Scalable to large datasets.
- Ability to handle noise and outliers.
- Interpretable results.
Distance Measures:
| Measure | Formula | Description |
|---|---|---|
| Euclidean | √Σ(xᵢ−yᵢ)² | Straight-line distance. Most common. |
| Manhattan | Σ|xᵢ−yᵢ| | Grid/city-block distance. |
| Minkowski | (Σ|xᵢ−yᵢ|ᵖ)^(1/p) | Generalizes Euclidean (p=2) and Manhattan (p=1). |
| Cosine | cos(θ) = (A·B)/(|A||B|) | Angle between vectors. Used in text mining. |
Types of Clustering:
| Type | Description | Example Algorithm |
|---|---|---|
| Partitioning | Divide n objects into k clusters | K-Means, K-Medoids (PAM) |
| Hierarchical | Build tree of clusters (dendrogram) | Agglomerative, Divisive |
| Density-based | Clusters = dense regions separated by low-density regions | DBSCAN |
| Grid-based | Quantize space into grid cells | STING, CLIQUE |
| Model-based | Assume data fits a statistical model | EM algorithm, Gaussian Mixture |
Steps:
- Choose k (number of clusters).
- Randomly initialize k centroids.
- Assignment step: Assign each data point to the nearest centroid.
- Update step: Recalculate each centroid as the mean of all points assigned to it.
- Repeat steps 3–4 until centroids no longer change (convergence).
Limitations:
- Must specify k in advance.
- Sensitive to initial centroid placement.
- Assumes spherical clusters of similar size.
- Sensitive to outliers (outlier can pull centroid).
- Not suitable for non-convex shaped clusters.
Key Parameters:
- ε (epsilon): Radius of the neighborhood around a point.
- MinPts: Minimum number of points required to form a dense region (core point).
Point Types:
- Core Point: Has at least MinPts points within distance ε (including itself).
- Border Point: Within ε of a core point, but has fewer than MinPts neighbors.
- Noise Point (Outlier): Not a core point and not within ε of any core point.
Algorithm:
- Start with an unvisited point.
- If it's a core point, create a new cluster and expand it by adding all density-reachable points.
- If it's a border point, assign to nearest cluster.
- If it's a noise point, label it as noise.
- Repeat for all unvisited points.
Advantages over K-Means: Discovers clusters of arbitrary shape, automatically finds number of clusters, handles noise/outliers well.
Disadvantage: Struggles with varying density clusters and high-dimensional data.
How it works:
- Place a kernel (smooth bump function) centered at each data point.
- Sum all kernels — regions with many points have high density.
- Peaks in the density estimate suggest cluster centers.
Bandwidth (h): Controls smoothness. Small h → spiky/overfit. Large h → over-smooth. Chosen by cross-validation.
Use in Clustering: KDE identifies modes (peaks) in the density, which correspond to cluster centers. Useful for finding natural groups without specifying k.
1. Agglomerative (Bottom-up) — Most Common:
- Start: each point is its own cluster.
- Find the two closest clusters and merge them.
- Repeat until all points are in one cluster.
- Cut the dendrogram at desired level to get clusters.
Linkage Criteria (how to measure distance between clusters):
| Linkage | Distance Between Clusters | Effect |
|---|---|---|
| Single | Min distance between any two points | Chaining effect, elongated clusters |
| Complete | Max distance between any two points | Compact, spherical clusters |
| Average | Average of all pairwise distances | Balanced compromise |
| Ward's | Minimizes total within-cluster variance | Compact, equal-size clusters |
2. Divisive (Top-down): Start with all points in one cluster, recursively split into smaller clusters. Less common.
Strengths: No need to specify k, produces intuitive dendrogram, can capture hierarchical structure.
Weaknesses: O(n²) time complexity, sensitive to noise, cannot undo a wrong merge.
How It Works:
- Bootstrap Sampling (Bagging): Create n random subsets of the training data (with replacement). Train one decision tree on each subset.
- Feature Randomness: At each node split, only a random subset of features is considered (typically √total_features for classification). This decorrelates the trees.
- Prediction:
- Classification: Majority vote across all trees.
- Regression: Average of all tree predictions.
Training Data
│
├── Bootstrap Sample 1 → Decision Tree 1 → Prediction 1
├── Bootstrap Sample 2 → Decision Tree 2 → Prediction 2
├── Bootstrap Sample 3 → Decision Tree 3 → Prediction 3
│ ... ... ...
└── Bootstrap Sample n → Decision Tree n → Prediction n
│
Majority Vote / Average
│
FINAL PREDICTION
Key Hyperparameters:
- n_estimators: Number of trees. More trees = better but slower.
- max_features: Number of features considered at each split.
- max_depth: Maximum depth of each tree (controls overfitting).
Advantages:
- Handles high-dimensional data well.
- Resistant to overfitting (due to averaging).
- Can compute feature importance.
- Works well even without parameter tuning.
- Handles missing values and maintains accuracy for large datasets.
Disadvantages:
- Not interpretable (black box).
- Computationally expensive for very large forests.
- Slower than a single decision tree.
Modeling Instances (Classification & Nearest Neighbor)
| Feature | Clustering | Classification |
|---|---|---|
| Learning Type | Unsupervised (no labels) | Supervised (labeled data) |
| Goal | Discover natural groups | Assign data to known classes |
| Output | Groups/clusters (unlabeled) | Class labels |
| Prior Knowledge | No class labels needed | Training data with labels required |
| Process | Single step (no training) | Two-step: train model → predict |
| Example | Group customers by behavior | Identify if customer will churn (yes/no) |
| Algorithms | K-Means, DBSCAN, Hierarchical | KNN, Decision Tree, Random Forest |
Classification — Two-Step Process:
- Training (Model Construction): A classifier is built using labeled training data. The model learns patterns that distinguish between classes.
- Testing (Prediction): The trained model is applied to new, unseen data to predict class labels. Accuracy is evaluated on the test set.
Signs of Overfitting: Very high training accuracy, much lower test accuracy. The gap between training error and test error is large.
Causes: Too many features, too deep a model, too small a training dataset, training for too long.
Methods to Avoid Overfitting:
- Cross-Validation: Use k-fold CV to evaluate model on multiple subsets of data.
- Pruning: In decision trees, remove branches that have little power to classify data.
- Regularization: Add a penalty for model complexity (L1/Lasso, L2/Ridge).
- Feature Selection: Remove irrelevant or redundant features.
- Get More Data: More training examples help the model generalize.
- Ensemble Methods: Combine many models (e.g., Random Forest) — averaging reduces variance.
- Early Stopping: In neural networks, stop training when validation error starts increasing.
- Dropout: In neural networks, randomly deactivate neurons during training.
Basic Concept: Given a new unknown point, find the k most similar (nearest) known points and use their characteristics to predict or describe the unknown.
Applications:
- Geographic analysis (nearest hospital, nearest store).
- Recommendation systems (users similar to you liked X).
- Anomaly detection (point with no near neighbors = outlier).
- Data imputation (fill missing values with nearest neighbor's value).
Types of NN Analysis:
- 1-NN: Find the single nearest neighbor.
- K-NN: Find k nearest neighbors and vote.
- Average NN: Use average of multiple neighbors.
- Radius-NN: All points within a distance r.
How ANN Works:
- For a new query point q, find all neighbors within a radius or find k nearest neighbors.
- Compute the average feature values (or average distance) of those neighbors.
- Assign the class that is most represented on average.
- Or, for regression, predict the average target value of the neighbors.
Example:
ANN vs Standard NN: Standard NN uses only the single closest point and is sensitive to noise. ANN uses multiple neighbors, making it more robust. However, ANN requires more computation.
Algorithm Steps:
- Choose the value of k.
- Calculate the distance from the new point to every training point.
- Sort the distances and identify k nearest neighbors.
- For classification: take majority vote among k neighbors.
- For regression: compute average of k neighbors' values.
Choosing k:
- Small k (k=1): Very sensitive to noise, complex boundary.
- Large k: Smoother boundary, may include irrelevant neighbors.
- Rule of thumb: k = √n. Always choose odd k for binary classification (avoid ties).
- Best: Use cross-validation to find optimal k.
| Feature | KNN |
|---|---|
| Type | Supervised, lazy learning |
| Training | None — memorizes data |
| Prediction | Slow (computes all distances) |
| Advantages | Simple, no training time, naturally handles multi-class |
| Disadvantages | Slow prediction, sensitive to irrelevant features, needs feature scaling |
| Applications | Recommendation systems, medical diagnosis, image recognition |
Feature Scaling in KNN: Since KNN is distance-based, attributes with larger ranges dominate. Always normalize features before applying KNN. Use Min-Max normalization: x' = (x − min) / (max − min).
Data Science for E-commerce Growth:
- Customer Segmentation (Clustering): K-Means segments customers by RFM (Recency, Frequency, Monetary value). Different marketing strategies for each segment (VIP customers, discount-seekers, churned customers).
- Recommendation Systems (KNN / Collaborative Filtering): "Customers like you also bought X." Netflix-style recommendations based on user similarity. Increases average order value.
- Churn Prediction (Classification): Predict which customers are likely to stop buying. Logistic regression / Random Forest on behavioral data. Proactive retention: targeted discounts, emails.
- Fraud Detection (Outlier Detection / Classification): Unusual transactions flagged as potential fraud. Isolation Forest, DBSCAN for anomaly detection. Saves revenue and protects customers.
- Demand Forecasting (Time Series): ARIMA/LSTM models predict future demand. Optimizes inventory, reduces stockouts and overstock.
- A/B Testing (Inferential Statistics): Test two versions of a webpage and determine which converts better. Statistical significance testing (p-values, confidence intervals).
- Price Optimization (Regression): Predict optimal prices using regression models on competitor prices, demand elasticity, and seasonality.
- AARRR Funnel (E-commerce Metrics):
- Acquisition: How do users find you?
- Activation: Do they have a good first experience?
- Retention: Do they come back?
- Referral: Do they tell others?
- Revenue: Do they pay?