DAV Mid-Sem Study Guide | GTU BE SEM-VI 3161613

Chapter 1

Math, Probability & Statistical Modeling

Probability & Inferential Statistics GTU 2024 Q1a / 2025S Q1a

Probability

Probability is the measure of the likelihood that a specific event will occur. It ranges from 0 (impossible) to 1 (certain).

Formula: P(A) = (Number of favorable outcomes) / (Total number of outcomes)

Key Characteristics of Probability:

0 ≤ P(A) ≤ 1 for any event A.
P(sample space) = 1 (something must happen).
If A and B are mutually exclusive: P(A or B) = P(A) + P(B).
Complement rule: P(not A) = 1 − P(A).

Types of Probability:

Classical: All outcomes equally likely (fair coin, dice).
Empirical: Based on observed frequency. P(A) = frequency of A / total observations.
Subjective: Based on personal judgment or expert opinion.
Conditional: P(A|B) = P(A ∩ B) / P(B) — probability of A given B has occurred.

Descriptive Statistics

Describes and summarizes the main features of a dataset. Includes mean, median, mode, variance, standard deviation, range. Does NOT make predictions beyond the given data.

Inferential Statistics

Uses sample data to make inferences (generalizations) about a larger population. Involves hypothesis testing, confidence intervals, and regression. Example: Using a sample of 500 customers to predict buying behavior of 1 million customers.

Feature	Descriptive Statistics	Inferential Statistics
Purpose	Describe/summarize data	Draw conclusions about population
Scope	Only given dataset	Generalizes beyond the sample
Tools	Mean, median, mode, SD	Hypothesis testing, regression, CI
Example	Average score of 30 students	Predict nationwide average from 30 students

Probability Distributions GTU 2024 Q3a

1. Normal Distribution (Gaussian):

Bell-shaped, symmetric curve. Mean = Median = Mode. Defined by mean (μ) and standard deviation (σ). 68% of data lies within ±1σ, 95% within ±2σ, 99.7% within ±3σ. Example: heights of people, exam scores.

2. Binomial Distribution:

Models the number of successes in n independent trials, each with probability p of success.
Formula: P(X=k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ
Example: Probability of getting exactly 3 heads in 5 coin flips.

3. Categorical Distribution:

Generalization of binomial for more than 2 categories. Each outcome belongs to one of k categories, each with its own probability. Example: Rolling a die (6 categories), predicting weather as Sunny/Rainy/Cloudy.

Naive Bayes Variants:

Variant	Assumes Features Follow	Used For
GaussianNB	Normal (Gaussian) distribution	Continuous features
BernoulliNB	Bernoulli (binary) distribution	Binary/boolean features (0 or 1)
MultinomialNB	Multinomial distribution	Count data (text classification)

Quantifying Correlation GTU 2023W Q1c — Numerical!

Correlation

Measures the strength and direction of a linear relationship between two variables. Pearson correlation coefficient r ranges from −1 to +1.

r = +1: Perfect positive correlation | r = 0: No correlation | r = −1: Perfect negative correlation

r = [n·Σxy − Σx·Σy] / √{[n·Σx² − (Σx)²] × [n·Σy² − (Σy)²]}

GTU Winter 2023 Q1c asks you to calculate correlation coefficient. Practice this numerical thoroughly.

GTU WINTER 2023 EXACT NUMERICAL — Correlation Coefficient

Speed (x): 46, 52, 53, 54, 59, 62

Accel (y): 12, 14, 17, 18, 17, 22

n = 6

Step 1 — Build the table:

x y x² y² xy

46 12 2116 144 552

52 14 2704 196 728

53 17 2809 289 901

54 18 2916 324 972

59 17 3481 289 1003

62 22 3844 484 1364

──────────────────────────────────────

Σx = 326 Σy = 100

Σx² = 17870  Σy² = 1726  Σxy = 5520

Step 2 — Apply formula:

Numerator = n·Σxy − Σx·Σy

= 6×5520 − 326×100

= 33120 − 32600 = 520

Denom part1 = n·Σx² − (Σx)² = 6×17870 − 326² = 107220 − 106276 = 944

Denom part2 = n·Σy² − (Σy)² = 6×1726 − 100² = 10356 − 10000 = 356

Denominator = √(944 × 356) = √(336064) = 579.88

r = 520 / 579.88 = 0.897

Conclusion: r ≈ 0.90 → Strong positive correlation between speed and acceleration.

Interpretation of r:

r value	Interpretation
0.9 to 1.0	Very strong positive
0.7 to 0.9	Strong positive
0.4 to 0.7	Moderate positive
0 to 0.4	Weak positive
0	No linear correlation
−0.4 to 0	Weak negative
−1.0 to −0.7	Strong negative

Other Correlation Methods:

Spearman's Rank Correlation: Non-parametric; works on ranked data. Used when data is not normally distributed.
Kendall's Tau: Based on concordant/discordant pairs. More robust for small datasets with ties.
Point-Biserial: Correlation between a continuous and a binary variable.

Detecting Outliers GTU 2023W Q1b / 2024W Q1b

Outlier

An outlier is a data point that significantly differs from other observations in a dataset. It can be caused by measurement error, data entry error, or genuine rare events.

Impact on Regression: Outliers can heavily distort regression lines, inflate standard errors, and mislead predictions. A single extreme point can pull the regression line toward it, giving a poor fit for the rest of the data.

Detection Techniques:

Z-Score Method: If |z| > 3, the point is an outlier. z = (x − μ) / σ. Works best for normally distributed data.
IQR Method (Box Plot):
- IQR = Q3 − Q1
- Lower bound = Q1 − 1.5 × IQR
- Upper bound = Q3 + 1.5 × IQR
- Any value outside these bounds = outlier.
Scatter Plot: Visually inspect data; outliers appear far from the cluster.
Box Plot: Points beyond the whiskers are outliers.
DBSCAN: Points that cannot be assigned to any cluster (noise points) are outliers.
Isolation Forest: ML method — outliers are isolated faster (fewer splits needed in a random tree).

Handling Outliers: Remove them (if data entry error), cap them (winsorization), transform them (log transform), or use robust models (median instead of mean).

Reducing Data Dimensionality GTU 2023W Q2c / 2024W Q4a

Dimensionality Reduction

Reducing the number of input features (dimensions) in a dataset while retaining as much important information as possible. Reduces computation, prevents overfitting, and enables visualization.

Why needed?

High-dimensional data leads to the "curse of dimensionality" — models become slower and less accurate.
Many features may be redundant or irrelevant.
Visualization is only possible in 2D/3D.

Methods:

PCA (Principal Component Analysis):
- Linear technique. Finds new axes (principal components) that capture maximum variance.
- PC1 = direction of greatest variance. PC2 = orthogonal to PC1, next most variance.
- Reduces correlated features into uncorrelated components.
- Parametric: assumes data is linear.
Factor Analysis:
- Identifies latent (hidden) factors that explain correlations among observed variables.
- E.g., "intelligence" as a hidden factor explaining scores in Math, Reading, Writing.
- Used for understanding underlying structure, not just compression.
Feature Selection vs Feature Extraction:

Feature	Feature Selection	Feature Extraction
Method	Selects a subset of original features	Creates new features from original
Interpretability	High (original features kept)	Lower (new abstract features)
Example	Select top 5 of 20 features	PCA transforms 20 features → 5 components
Techniques	Filter, Wrapper, Embedded methods	PCA, LDA, Autoencoders

Linear Algebra Approach (SVD / Matrix Decomposition): Singular Value Decomposition decomposes a matrix into U, Σ, Vᵀ. Keeping only the top-k singular values gives a lower-dimensional approximation. Used in recommendation systems, text analysis (LSA).

Regression Methods GTU 2023S Q2c / 2023W Q4c / 2024W Q2 OR

Regression

A statistical method to model the relationship between a dependent variable (output) and one or more independent variables (inputs). Used for prediction of continuous values.

1. Linear Regression:

Models relationship as a straight line. Assumes linear relationship between input and output.

Simple Linear: y = mx + b (one independent variable)
Multiple Linear: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ

Example: Predict house price based on area. As area increases, price increases linearly.

2. Non-Linear Regression:

Models a curved (non-linear) relationship. The relationship between x and y follows a polynomial, exponential, or other non-linear function.

Example: y = ax² + bx + c (Polynomial regression). Used when data shows curves — e.g., population growth, drug dose-response curves.

3. Logistic Regression:

Despite the name, it is used for classification (predicting categorical outcomes). Outputs a probability between 0 and 1 using the sigmoid function.

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)
Example: Predict if email is spam (1) or not (0). Predict if a patient has a disease.

4. OLS (Ordinary Least Squares):

A method to estimate parameters in linear regression by minimizing the sum of squared residuals (differences between observed and predicted values).

Residual = actual y − predicted y. OLS minimizes Σ(residuals)².

Method	Output	Use Case
Linear Regression	Continuous value	House price prediction
Non-linear Regression	Continuous value (curved)	Growth curves, biological models
Logistic Regression	Probability (0–1) → class	Spam detection, disease diagnosis
OLS	Regression coefficients	Parameter estimation

Time Series Analysis GTU 2023S Q2 OR / 2024W Q1c / 2025S Q4a

Time Series

A sequence of data points collected or recorded at successive time intervals. The order and time dependency of observations matters.

Examples: Daily stock prices, monthly sales, annual temperature, hourly website traffic, ECG signals.

Components of Time Series:

Trend: Long-term upward or downward movement. E.g., population growing over decades.
Seasonality: Regular repeating patterns tied to calendar time. E.g., higher ice cream sales in summer every year.
Cyclical: Long-term wave-like fluctuations (not fixed periods). E.g., economic boom-bust cycles over several years.
Irregular/Noise: Random, unpredictable variation. E.g., sudden spike in sales due to a viral social media post.

Time Series Analysis Techniques:

Moving Average (MA): Smooths data by averaging over a window of time periods. Removes short-term fluctuations to reveal trends.
Exponential Smoothing: Weighted average giving more importance to recent observations. Good for short-term forecasting.
ARIMA (Auto-Regressive Integrated Moving Average):
- AR: uses past values to predict future.
- I: differencing to make data stationary.
- MA: models error as a moving average of past errors.
Seasonal Decomposition: Separates the series into Trend + Seasonal + Residual components.
ACF/PACF: Autocorrelation and partial autocorrelation plots — used to identify the order of AR and MA components.

Time series is different from regular regression because observations are NOT independent — each value depends on previous ones.

Chapter 2

Using Clustering to Subdivide Data

Clustering Basics GTU Multiple Years

Cluster

A cluster is a group of data objects that are similar to each other within the group and dissimilar to objects in other groups. Clustering is an unsupervised learning method — no pre-defined labels.

Goal: Discover natural groupings in data without labels.

Applications: Customer segmentation, document grouping, gene expression analysis, anomaly detection, image segmentation.

Characteristics of Good Clustering:

High intra-cluster similarity (within group, objects are similar).
Low inter-cluster similarity (between groups, objects are dissimilar).
Scalable to large datasets.
Ability to handle noise and outliers.
Interpretable results.

Distance Measures:

Measure	Formula	Description
Euclidean	√Σ(xᵢ−yᵢ)²	Straight-line distance. Most common.
Manhattan	Σ\|xᵢ−yᵢ\|	Grid/city-block distance.
Minkowski	(Σ\|xᵢ−yᵢ\|ᵖ)^(1/p)	Generalizes Euclidean (p=2) and Manhattan (p=1).
Cosine	cos(θ) = (A·B)/(\|A\|\|B\|)	Angle between vectors. Used in text mining.

Types of Clustering:

Type	Description	Example Algorithm
Partitioning	Divide n objects into k clusters	K-Means, K-Medoids (PAM)
Hierarchical	Build tree of clusters (dendrogram)	Agglomerative, Divisive
Density-based	Clusters = dense regions separated by low-density regions	DBSCAN
Grid-based	Quantize space into grid cells	STING, CLIQUE
Model-based	Assume data fits a statistical model	EM algorithm, Gaussian Mixture

K-Means Clustering Algorithm GTU 2023W Q4c / 2025W Q5c

K-Means partitions n data points into k clusters by minimizing the sum of squared distances from each point to its cluster centroid (mean).

Steps:

Choose k (number of clusters).
Randomly initialize k centroids.
Assignment step: Assign each data point to the nearest centroid.
Update step: Recalculate each centroid as the mean of all points assigned to it.
Repeat steps 3–4 until centroids no longer change (convergence).

K-Means Numerical Example

Data points: A(1,1), B(2,1), C(4,3), D(5,4) | k=2

Initial centroids: C1=(1,1), C2=(5,4)

Iteration 1 — Assignment:

A(1,1): d(C1)=0, d(C2)=√25=5 → Cluster 1

B(2,1): d(C1)=1, d(C2)=√13=3.6 → Cluster 1

C(4,3): d(C1)=√13=3.6, d(C2)=√2=1.4 → Cluster 2

D(5,4): d(C1)=√25=5, d(C2)=0 → Cluster 2

Iteration 1 — Update centroids:

C1 = mean(A,B) = ((1+2)/2, (1+1)/2) = (1.5, 1)

C2 = mean(C,D) = ((4+5)/2, (3+4)/2) = (4.5, 3.5)

Iteration 2 — Reassign (check if same):

A(1,1): d(C1)=0.5, d(C2)=4.27 → Cluster 1 ✓

B(2,1): d(C1)=0.5, d(C2)=3.2 → Cluster 1 ✓

C(4,3): d(C1)=3.2, d(C2)=0.71 → Cluster 2 ✓

D(5,4): d(C1)=4.27, d(C2)=0.71 → Cluster 2 ✓

No change → Converged! Final: {A,B} and {C,D}

Limitations:

Must specify k in advance.
Sensitive to initial centroid placement.
Assumes spherical clusters of similar size.
Sensitive to outliers (outlier can pull centroid).
Not suitable for non-convex shaped clusters.

DBSCAN (Density-Based Clustering) GTU 2024W Q2b / 2025W Q3b

DBSCAN

Density-Based Spatial Clustering of Applications with Noise. Groups points that are closely packed together and marks low-density points as noise (outliers). Does NOT require specifying k in advance.

Key Parameters:

ε (epsilon): Radius of the neighborhood around a point.
MinPts: Minimum number of points required to form a dense region (core point).

Point Types:

Core Point: Has at least MinPts points within distance ε (including itself).
Border Point: Within ε of a core point, but has fewer than MinPts neighbors.
Noise Point (Outlier): Not a core point and not within ε of any core point.

Algorithm:

Start with an unvisited point.
If it's a core point, create a new cluster and expand it by adding all density-reachable points.
If it's a border point, assign to nearest cluster.
If it's a noise point, label it as noise.
Repeat for all unvisited points.

Advantages over K-Means: Discovers clusters of arbitrary shape, automatically finds number of clusters, handles noise/outliers well.

Disadvantage: Struggles with varying density clusters and high-dimensional data.

KDE — Kernel Density Estimation GTU 2023S Q3b / 2025W Q3b

KDE

A non-parametric method to estimate the probability density function of a dataset. Instead of using a histogram (hard bins), KDE places a smooth "kernel" (usually Gaussian) at each data point and sums them up.

How it works:

Place a kernel (smooth bump function) centered at each data point.
Sum all kernels — regions with many points have high density.
Peaks in the density estimate suggest cluster centers.

Bandwidth (h): Controls smoothness. Small h → spiky/overfit. Large h → over-smooth. Chosen by cross-validation.

Use in Clustering: KDE identifies modes (peaks) in the density, which correspond to cluster centers. Useful for finding natural groups without specifying k.

Hierarchical Clustering GTU 2024 Q3c / 2023W Q3c

Creates a tree of clusters called a dendrogram. Does not require specifying k. User can cut the dendrogram at any level to get any number of clusters.

1. Agglomerative (Bottom-up) — Most Common:

Start: each point is its own cluster.
Find the two closest clusters and merge them.
Repeat until all points are in one cluster.
Cut the dendrogram at desired level to get clusters.

Linkage Criteria (how to measure distance between clusters):

Linkage	Distance Between Clusters	Effect
Single	Min distance between any two points	Chaining effect, elongated clusters
Complete	Max distance between any two points	Compact, spherical clusters
Average	Average of all pairwise distances	Balanced compromise
Ward's	Minimizes total within-cluster variance	Compact, equal-size clusters

2. Divisive (Top-down): Start with all points in one cluster, recursively split into smaller clusters. Less common.

Strengths: No need to specify k, produces intuitive dendrogram, can capture hierarchical structure.
Weaknesses: O(n²) time complexity, sensitive to noise, cannot undo a wrong merge.

Random Forest Algorithm GTU 2023W Q2c / 2024 Q2c / 2025S Q2c

Random Forest

An ensemble learning method that builds multiple decision trees during training and combines their outputs (majority vote for classification, average for regression). "Random" because each tree is built on a random subset of data and features.

How It Works:

Bootstrap Sampling (Bagging): Create n random subsets of the training data (with replacement). Train one decision tree on each subset.
Feature Randomness: At each node split, only a random subset of features is considered (typically √total_features for classification). This decorrelates the trees.
Prediction:
- Classification: Majority vote across all trees.
- Regression: Average of all tree predictions.

Training Data
     │
     ├── Bootstrap Sample 1 → Decision Tree 1 → Prediction 1
     ├── Bootstrap Sample 2 → Decision Tree 2 → Prediction 2
     ├── Bootstrap Sample 3 → Decision Tree 3 → Prediction 3
     │        ...                  ...               ...
     └── Bootstrap Sample n → Decision Tree n → Prediction n
                                                      │
                                              Majority Vote / Average
                                                      │
                                               FINAL PREDICTION

Key Hyperparameters:

n_estimators: Number of trees. More trees = better but slower.
max_features: Number of features considered at each split.
max_depth: Maximum depth of each tree (controls overfitting).

Advantages:

Handles high-dimensional data well.
Resistant to overfitting (due to averaging).
Can compute feature importance.
Works well even without parameter tuning.
Handles missing values and maintains accuracy for large datasets.

Disadvantages:

Not interpretable (black box).
Computationally expensive for very large forests.
Slower than a single decision tree.

GTU asks "Explain working of Random Forest" almost every year. Always include the diagram above and the 3-step process (Bootstrap → Tree → Vote).

Chapter 3

Modeling Instances (Classification & Nearest Neighbor)

Clustering vs Classification GTU 2023S Q1b / 2023W Q3b / 2024W Q2a

Feature	Clustering	Classification
Learning Type	Unsupervised (no labels)	Supervised (labeled data)
Goal	Discover natural groups	Assign data to known classes
Output	Groups/clusters (unlabeled)	Class labels
Prior Knowledge	No class labels needed	Training data with labels required
Process	Single step (no training)	Two-step: train model → predict
Example	Group customers by behavior	Identify if customer will churn (yes/no)
Algorithms	K-Means, DBSCAN, Hierarchical	KNN, Decision Tree, Random Forest

Classification — Two-Step Process:

Training (Model Construction): A classifier is built using labeled training data. The model learns patterns that distinguish between classes.
Testing (Prediction): The trained model is applied to new, unseen data to predict class labels. Accuracy is evaluated on the test set.

Overfitting — Problem & Solutions GTU 2023S Q3a / 2025S Q3a

Overfitting

Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — and performs poorly on new, unseen data. The model is too complex.

Underfitting (opposite)

Model is too simple to capture the underlying patterns. Performs poorly on both training and test data.

Signs of Overfitting: Very high training accuracy, much lower test accuracy. The gap between training error and test error is large.

Causes: Too many features, too deep a model, too small a training dataset, training for too long.

Methods to Avoid Overfitting:

Cross-Validation: Use k-fold CV to evaluate model on multiple subsets of data.
Pruning: In decision trees, remove branches that have little power to classify data.
Regularization: Add a penalty for model complexity (L1/Lasso, L2/Ridge).
Feature Selection: Remove irrelevant or redundant features.
Get More Data: More training examples help the model generalize.
Ensemble Methods: Combine many models (e.g., Random Forest) — averaging reduces variance.
Early Stopping: In neural networks, stop training when validation error starts increasing.
Dropout: In neural networks, randomly deactivate neurons during training.

Overgeneralization (opposite problem)

Model generalizes too broadly and misses important distinctions. E.g., model classifies all animals as "dog" because most training examples were dogs.

Nearest Neighbor Analysis GTU 2023W Q3c / 2024W Q3b

Nearest Neighbor Analysis

A technique that identifies the closest data points to a query point based on a distance metric. Used to analyze spatial distribution, classify new points, or find similar records.

Basic Concept: Given a new unknown point, find the k most similar (nearest) known points and use their characteristics to predict or describe the unknown.

Applications:

Geographic analysis (nearest hospital, nearest store).
Recommendation systems (users similar to you liked X).
Anomaly detection (point with no near neighbors = outlier).
Data imputation (fill missing values with nearest neighbor's value).

Types of NN Analysis:

1-NN: Find the single nearest neighbor.
K-NN: Find k nearest neighbors and vote.
Average NN: Use average of multiple neighbors.
Radius-NN: All points within a distance r.

Average Nearest Neighbor (ANN) Algorithm GTU 2023S Q3c / 2025S Q3b

ANN Algorithm

Instead of using the single nearest neighbor, ANN calculates the average distance to all neighbors (or a set number of neighbors) and uses the average for classification or prediction. Reduces the impact of noisy/outlier neighbors.

How ANN Works:

For a new query point q, find all neighbors within a radius or find k nearest neighbors.
Compute the average feature values (or average distance) of those neighbors.
Assign the class that is most represented on average.
Or, for regression, predict the average target value of the neighbors.

Example:

ANN Example

Training data (with class labels):

P1(1,2) → Class A | P2(2,3) → Class A | P3(3,1) → Class B

P4(5,4) → Class B | P5(6,5) → Class B

Query point: Q(3,3), k=3

d(Q,P1) = √(4+1) = 2.24

d(Q,P2) = √(1+0) = 1.00

d(Q,P3) = √(0+4) = 2.00

d(Q,P4) = √(4+1) = 2.24

d(Q,P5) = √(9+4) = 3.61

3 nearest: P2(A,1.00), P3(B,2.00), P1(A,2.24)

Average distance = (1.00+2.00+2.24)/3 = 1.75

Class A has 2 votes, Class B has 1 vote → Classify Q as Class A

ANN vs Standard NN: Standard NN uses only the single closest point and is sensitive to noise. ANN uses multiple neighbors, making it more robust. However, ANN requires more computation.

K-Nearest Neighbor (KNN) Algorithm GTU 2024 Q3c / 2024W Q2c / 2025S Q2 OR / 2025W Q3c

K-Nearest Neighbor

A simple, non-parametric, lazy learning algorithm. For classification: assigns the class of the majority of k nearest training points. For regression: predicts the average value of k nearest points. No explicit training phase — it memorizes all training data.

Algorithm Steps:

Choose the value of k.
Calculate the distance from the new point to every training point.
Sort the distances and identify k nearest neighbors.
For classification: take majority vote among k neighbors.
For regression: compute average of k neighbors' values.

KNN Numerical Example (Classification, k=3)

Training dataset:

ID | Height | Weight | Class

───────────────────────────

P1 | 150 | 50 | Thin

P2 | 160 | 60 | Normal

P3 | 170 | 70 | Normal

P4 | 180 | 90 | Fat

P5 | 190 | 100 | Fat

New person: Height=165, Weight=65. k=3

Euclidean distances:

d(P1) = √((165−150)²+(65−50)²) = √(225+225) = 21.2

d(P2) = √((165−160)²+(65−60)²) = √(25+25) = 7.07

d(P3) = √((165−170)²+(65−70)²) = √(25+25) = 7.07

d(P4) = √((165−180)²+(65−90)²) = √(225+625) = 29.2

d(P5) = √((165−190)²+(65−100)²) = √(625+1225)= 43.0

3 Nearest: P2(Normal,7.07), P3(Normal,7.07), P1(Thin,21.2)

Votes: Normal=2, Thin=1

Prediction: New person is → Normal

Choosing k:

Small k (k=1): Very sensitive to noise, complex boundary.
Large k: Smoother boundary, may include irrelevant neighbors.
Rule of thumb: k = √n. Always choose odd k for binary classification (avoid ties).
Best: Use cross-validation to find optimal k.

Feature	KNN
Type	Supervised, lazy learning
Training	None — memorizes data
Prediction	Slow (computes all distances)
Advantages	Simple, no training time, naturally handles multi-class
Disadvantages	Slow prediction, sensitive to irrelevant features, needs feature scaling
Applications	Recommendation systems, medical diagnosis, image recognition

KNN is asked in virtually every GTU exam. Always draw the distance table clearly, show the 3 nearest, and state the majority vote. Feature scaling (normalization) is important — mention it.

Feature Scaling in KNN: Since KNN is distance-based, attributes with larger ranges dominate. Always normalize features before applying KNN. Use Min-Max normalization: x' = (x − min) / (max − min).

Real-World Problems — E-commerce Applications GTU 2023S Q1c / 2023W Q5c

Data Science for E-commerce Growth:

Customer Segmentation (Clustering): K-Means segments customers by RFM (Recency, Frequency, Monetary value). Different marketing strategies for each segment (VIP customers, discount-seekers, churned customers).
Recommendation Systems (KNN / Collaborative Filtering): "Customers like you also bought X." Netflix-style recommendations based on user similarity. Increases average order value.
Churn Prediction (Classification): Predict which customers are likely to stop buying. Logistic regression / Random Forest on behavioral data. Proactive retention: targeted discounts, emails.
Fraud Detection (Outlier Detection / Classification): Unusual transactions flagged as potential fraud. Isolation Forest, DBSCAN for anomaly detection. Saves revenue and protects customers.
Demand Forecasting (Time Series): ARIMA/LSTM models predict future demand. Optimizes inventory, reduces stockouts and overstock.
A/B Testing (Inferential Statistics): Test two versions of a webpage and determine which converts better. Statistical significance testing (p-values, confidence intervals).
Price Optimization (Regression): Predict optimal prices using regression models on competitor prices, demand elasticity, and seasonality.
AARRR Funnel (E-commerce Metrics):
- Acquisition: How do users find you?
- Activation: Do they have a good first experience?
- Retention: Do they come back?
- Referral: Do they tell others?
- Revenue: Do they pay?

For e-commerce questions, always mention: customer segmentation, churn prediction, recommendation systems, and fraud detection — these 4 cover most GTU answers.

Data Analysis &Visualization — Mid-Sem Guide

Math, Probability & Statistical Modeling

Using Clustering to Subdivide Data

Modeling Instances (Classification & Nearest Neighbor)

Data Analysis &
Visualization — Mid-Sem Guide