GTU · BE SEM-VI · Subject Code 3161613

Data Analysis &
Visualization — Mid-Sem Guide

Chapters 1, 2 & 3 · Theory + Numericals + GTU Paper Focus

Chapter 1

Math, Probability & Statistical Modeling

Probability & Inferential Statistics GTU 2024 Q1a / 2025S Q1a
Probability
Probability is the measure of the likelihood that a specific event will occur. It ranges from 0 (impossible) to 1 (certain).

Formula: P(A) = (Number of favorable outcomes) / (Total number of outcomes)

Key Characteristics of Probability:

  1. 0 ≤ P(A) ≤ 1 for any event A.
  2. P(sample space) = 1 (something must happen).
  3. If A and B are mutually exclusive: P(A or B) = P(A) + P(B).
  4. Complement rule: P(not A) = 1 − P(A).

Types of Probability:

  • Classical: All outcomes equally likely (fair coin, dice).
  • Empirical: Based on observed frequency. P(A) = frequency of A / total observations.
  • Subjective: Based on personal judgment or expert opinion.
  • Conditional: P(A|B) = P(A ∩ B) / P(B) — probability of A given B has occurred.
Descriptive Statistics
Describes and summarizes the main features of a dataset. Includes mean, median, mode, variance, standard deviation, range. Does NOT make predictions beyond the given data.
Inferential Statistics
Uses sample data to make inferences (generalizations) about a larger population. Involves hypothesis testing, confidence intervals, and regression. Example: Using a sample of 500 customers to predict buying behavior of 1 million customers.
FeatureDescriptive StatisticsInferential Statistics
PurposeDescribe/summarize dataDraw conclusions about population
ScopeOnly given datasetGeneralizes beyond the sample
ToolsMean, median, mode, SDHypothesis testing, regression, CI
ExampleAverage score of 30 studentsPredict nationwide average from 30 students
Probability Distributions GTU 2024 Q3a

1. Normal Distribution (Gaussian):

Bell-shaped, symmetric curve. Mean = Median = Mode. Defined by mean (μ) and standard deviation (σ). 68% of data lies within ±1σ, 95% within ±2σ, 99.7% within ±3σ. Example: heights of people, exam scores.

2. Binomial Distribution:

Models the number of successes in n independent trials, each with probability p of success.
Formula: P(X=k) = C(n,k) × pᵏ × (1−p)ⁿ⁻ᵏ
Example: Probability of getting exactly 3 heads in 5 coin flips.

3. Categorical Distribution:

Generalization of binomial for more than 2 categories. Each outcome belongs to one of k categories, each with its own probability. Example: Rolling a die (6 categories), predicting weather as Sunny/Rainy/Cloudy.

Naive Bayes Variants:

VariantAssumes Features FollowUsed For
GaussianNBNormal (Gaussian) distributionContinuous features
BernoulliNBBernoulli (binary) distributionBinary/boolean features (0 or 1)
MultinomialNBMultinomial distributionCount data (text classification)
Quantifying Correlation GTU 2023W Q1c — Numerical!
Correlation
Measures the strength and direction of a linear relationship between two variables. Pearson correlation coefficient r ranges from −1 to +1.

r = +1: Perfect positive correlation | r = 0: No correlation | r = −1: Perfect negative correlation
r = [n·Σxy − Σx·Σy] / √{[n·Σx² − (Σx)²] × [n·Σy² − (Σy)²]}
GTU Winter 2023 Q1c asks you to calculate correlation coefficient. Practice this numerical thoroughly.
GTU WINTER 2023 EXACT NUMERICAL — Correlation Coefficient
Speed (x): 46, 52, 53, 54, 59, 62
Accel (y): 12, 14, 17, 18, 17, 22
n = 6
Step 1 — Build the table:
x y x² y² xy
46 12 2116 144 552
52 14 2704 196 728
53 17 2809 289 901
54 18 2916 324 972
59 17 3481 289 1003
62 22 3844 484 1364
──────────────────────────────────────
Σx = 326 Σy = 100
Σx² = 17870 Σy² = 1726 Σxy = 5520
Step 2 — Apply formula:
Numerator = n·Σxy − Σx·Σy
= 6×5520 − 326×100
= 33120 − 32600 = 520
Denom part1 = n·Σx² − (Σx)² = 6×17870 − 326² = 107220 − 106276 = 944
Denom part2 = n·Σy² − (Σy)² = 6×1726 − 100² = 10356 − 10000 = 356
Denominator = √(944 × 356) = √(336064) = 579.88
r = 520 / 579.88 = 0.897
Conclusion: r ≈ 0.90 → Strong positive correlation between speed and acceleration.

Interpretation of r:

r valueInterpretation
0.9 to 1.0Very strong positive
0.7 to 0.9Strong positive
0.4 to 0.7Moderate positive
0 to 0.4Weak positive
0No linear correlation
−0.4 to 0Weak negative
−1.0 to −0.7Strong negative

Other Correlation Methods:

  • Spearman's Rank Correlation: Non-parametric; works on ranked data. Used when data is not normally distributed.
  • Kendall's Tau: Based on concordant/discordant pairs. More robust for small datasets with ties.
  • Point-Biserial: Correlation between a continuous and a binary variable.
Detecting Outliers GTU 2023W Q1b / 2024W Q1b
Outlier
An outlier is a data point that significantly differs from other observations in a dataset. It can be caused by measurement error, data entry error, or genuine rare events.

Impact on Regression: Outliers can heavily distort regression lines, inflate standard errors, and mislead predictions. A single extreme point can pull the regression line toward it, giving a poor fit for the rest of the data.

Detection Techniques:

  1. Z-Score Method: If |z| > 3, the point is an outlier. z = (x − μ) / σ. Works best for normally distributed data.
  2. IQR Method (Box Plot):
    • IQR = Q3 − Q1
    • Lower bound = Q1 − 1.5 × IQR
    • Upper bound = Q3 + 1.5 × IQR
    • Any value outside these bounds = outlier.
  3. Scatter Plot: Visually inspect data; outliers appear far from the cluster.
  4. Box Plot: Points beyond the whiskers are outliers.
  5. DBSCAN: Points that cannot be assigned to any cluster (noise points) are outliers.
  6. Isolation Forest: ML method — outliers are isolated faster (fewer splits needed in a random tree).

Handling Outliers: Remove them (if data entry error), cap them (winsorization), transform them (log transform), or use robust models (median instead of mean).

Reducing Data Dimensionality GTU 2023W Q2c / 2024W Q4a
Dimensionality Reduction
Reducing the number of input features (dimensions) in a dataset while retaining as much important information as possible. Reduces computation, prevents overfitting, and enables visualization.

Why needed?

  • High-dimensional data leads to the "curse of dimensionality" — models become slower and less accurate.
  • Many features may be redundant or irrelevant.
  • Visualization is only possible in 2D/3D.

Methods:

  1. PCA (Principal Component Analysis):
    • Linear technique. Finds new axes (principal components) that capture maximum variance.
    • PC1 = direction of greatest variance. PC2 = orthogonal to PC1, next most variance.
    • Reduces correlated features into uncorrelated components.
    • Parametric: assumes data is linear.
  2. Factor Analysis:
    • Identifies latent (hidden) factors that explain correlations among observed variables.
    • E.g., "intelligence" as a hidden factor explaining scores in Math, Reading, Writing.
    • Used for understanding underlying structure, not just compression.
  3. Feature Selection vs Feature Extraction:
FeatureFeature SelectionFeature Extraction
MethodSelects a subset of original featuresCreates new features from original
InterpretabilityHigh (original features kept)Lower (new abstract features)
ExampleSelect top 5 of 20 featuresPCA transforms 20 features → 5 components
TechniquesFilter, Wrapper, Embedded methodsPCA, LDA, Autoencoders
  1. Linear Algebra Approach (SVD / Matrix Decomposition): Singular Value Decomposition decomposes a matrix into U, Σ, Vᵀ. Keeping only the top-k singular values gives a lower-dimensional approximation. Used in recommendation systems, text analysis (LSA).
Regression Methods GTU 2023S Q2c / 2023W Q4c / 2024W Q2 OR
Regression
A statistical method to model the relationship between a dependent variable (output) and one or more independent variables (inputs). Used for prediction of continuous values.

1. Linear Regression:

Models relationship as a straight line. Assumes linear relationship between input and output.

Simple Linear: y = mx + b (one independent variable)
Multiple Linear: y = b₀ + b₁x₁ + b₂x₂ + ... + bₙxₙ

Example: Predict house price based on area. As area increases, price increases linearly.

2. Non-Linear Regression:

Models a curved (non-linear) relationship. The relationship between x and y follows a polynomial, exponential, or other non-linear function.

Example: y = ax² + bx + c (Polynomial regression). Used when data shows curves — e.g., population growth, drug dose-response curves.

3. Logistic Regression:

Despite the name, it is used for classification (predicting categorical outcomes). Outputs a probability between 0 and 1 using the sigmoid function.

Sigmoid: σ(z) = 1 / (1 + e⁻ᶻ)
Example: Predict if email is spam (1) or not (0). Predict if a patient has a disease.

4. OLS (Ordinary Least Squares):

A method to estimate parameters in linear regression by minimizing the sum of squared residuals (differences between observed and predicted values).

Residual = actual y − predicted y. OLS minimizes Σ(residuals)².
MethodOutputUse Case
Linear RegressionContinuous valueHouse price prediction
Non-linear RegressionContinuous value (curved)Growth curves, biological models
Logistic RegressionProbability (0–1) → classSpam detection, disease diagnosis
OLSRegression coefficientsParameter estimation
Time Series Analysis GTU 2023S Q2 OR / 2024W Q1c / 2025S Q4a
Time Series
A sequence of data points collected or recorded at successive time intervals. The order and time dependency of observations matters.

Examples: Daily stock prices, monthly sales, annual temperature, hourly website traffic, ECG signals.

Components of Time Series:

  1. Trend: Long-term upward or downward movement. E.g., population growing over decades.
  2. Seasonality: Regular repeating patterns tied to calendar time. E.g., higher ice cream sales in summer every year.
  3. Cyclical: Long-term wave-like fluctuations (not fixed periods). E.g., economic boom-bust cycles over several years.
  4. Irregular/Noise: Random, unpredictable variation. E.g., sudden spike in sales due to a viral social media post.

Time Series Analysis Techniques:

  1. Moving Average (MA): Smooths data by averaging over a window of time periods. Removes short-term fluctuations to reveal trends.
  2. Exponential Smoothing: Weighted average giving more importance to recent observations. Good for short-term forecasting.
  3. ARIMA (Auto-Regressive Integrated Moving Average):
    • AR: uses past values to predict future.
    • I: differencing to make data stationary.
    • MA: models error as a moving average of past errors.
  4. Seasonal Decomposition: Separates the series into Trend + Seasonal + Residual components.
  5. ACF/PACF: Autocorrelation and partial autocorrelation plots — used to identify the order of AR and MA components.
Time series is different from regular regression because observations are NOT independent — each value depends on previous ones.
Chapter 2

Using Clustering to Subdivide Data

Clustering Basics GTU Multiple Years
Cluster
A cluster is a group of data objects that are similar to each other within the group and dissimilar to objects in other groups. Clustering is an unsupervised learning method — no pre-defined labels.

Goal: Discover natural groupings in data without labels.

Applications: Customer segmentation, document grouping, gene expression analysis, anomaly detection, image segmentation.

Characteristics of Good Clustering:

  1. High intra-cluster similarity (within group, objects are similar).
  2. Low inter-cluster similarity (between groups, objects are dissimilar).
  3. Scalable to large datasets.
  4. Ability to handle noise and outliers.
  5. Interpretable results.

Distance Measures:

MeasureFormulaDescription
Euclidean√Σ(xᵢ−yᵢ)²Straight-line distance. Most common.
ManhattanΣ|xᵢ−yᵢ|Grid/city-block distance.
Minkowski(Σ|xᵢ−yᵢ|ᵖ)^(1/p)Generalizes Euclidean (p=2) and Manhattan (p=1).
Cosinecos(θ) = (A·B)/(|A||B|)Angle between vectors. Used in text mining.

Types of Clustering:

TypeDescriptionExample Algorithm
PartitioningDivide n objects into k clustersK-Means, K-Medoids (PAM)
HierarchicalBuild tree of clusters (dendrogram)Agglomerative, Divisive
Density-basedClusters = dense regions separated by low-density regionsDBSCAN
Grid-basedQuantize space into grid cellsSTING, CLIQUE
Model-basedAssume data fits a statistical modelEM algorithm, Gaussian Mixture
K-Means Clustering Algorithm GTU 2023W Q4c / 2025W Q5c
K-Means partitions n data points into k clusters by minimizing the sum of squared distances from each point to its cluster centroid (mean).

Steps:

  1. Choose k (number of clusters).
  2. Randomly initialize k centroids.
  3. Assignment step: Assign each data point to the nearest centroid.
  4. Update step: Recalculate each centroid as the mean of all points assigned to it.
  5. Repeat steps 3–4 until centroids no longer change (convergence).
K-Means Numerical Example
Data points: A(1,1), B(2,1), C(4,3), D(5,4) | k=2
Initial centroids: C1=(1,1), C2=(5,4)
Iteration 1 — Assignment:
A(1,1): d(C1)=0, d(C2)=√25=5 → Cluster 1
B(2,1): d(C1)=1, d(C2)=√13=3.6 → Cluster 1
C(4,3): d(C1)=√13=3.6, d(C2)=√2=1.4 → Cluster 2
D(5,4): d(C1)=√25=5, d(C2)=0 → Cluster 2
Iteration 1 — Update centroids:
C1 = mean(A,B) = ((1+2)/2, (1+1)/2) = (1.5, 1)
C2 = mean(C,D) = ((4+5)/2, (3+4)/2) = (4.5, 3.5)
Iteration 2 — Reassign (check if same):
A(1,1): d(C1)=0.5, d(C2)=4.27 → Cluster 1 ✓
B(2,1): d(C1)=0.5, d(C2)=3.2 → Cluster 1 ✓
C(4,3): d(C1)=3.2, d(C2)=0.71 → Cluster 2 ✓
D(5,4): d(C1)=4.27, d(C2)=0.71 → Cluster 2 ✓
No change → Converged! Final: {A,B} and {C,D}

Limitations:

  • Must specify k in advance.
  • Sensitive to initial centroid placement.
  • Assumes spherical clusters of similar size.
  • Sensitive to outliers (outlier can pull centroid).
  • Not suitable for non-convex shaped clusters.
DBSCAN (Density-Based Clustering) GTU 2024W Q2b / 2025W Q3b
DBSCAN
Density-Based Spatial Clustering of Applications with Noise. Groups points that are closely packed together and marks low-density points as noise (outliers). Does NOT require specifying k in advance.

Key Parameters:

  • ε (epsilon): Radius of the neighborhood around a point.
  • MinPts: Minimum number of points required to form a dense region (core point).

Point Types:

  1. Core Point: Has at least MinPts points within distance ε (including itself).
  2. Border Point: Within ε of a core point, but has fewer than MinPts neighbors.
  3. Noise Point (Outlier): Not a core point and not within ε of any core point.

Algorithm:

  1. Start with an unvisited point.
  2. If it's a core point, create a new cluster and expand it by adding all density-reachable points.
  3. If it's a border point, assign to nearest cluster.
  4. If it's a noise point, label it as noise.
  5. Repeat for all unvisited points.

Advantages over K-Means: Discovers clusters of arbitrary shape, automatically finds number of clusters, handles noise/outliers well.

Disadvantage: Struggles with varying density clusters and high-dimensional data.

KDE — Kernel Density Estimation GTU 2023S Q3b / 2025W Q3b
KDE
A non-parametric method to estimate the probability density function of a dataset. Instead of using a histogram (hard bins), KDE places a smooth "kernel" (usually Gaussian) at each data point and sums them up.

How it works:

  1. Place a kernel (smooth bump function) centered at each data point.
  2. Sum all kernels — regions with many points have high density.
  3. Peaks in the density estimate suggest cluster centers.

Bandwidth (h): Controls smoothness. Small h → spiky/overfit. Large h → over-smooth. Chosen by cross-validation.

Use in Clustering: KDE identifies modes (peaks) in the density, which correspond to cluster centers. Useful for finding natural groups without specifying k.

Hierarchical Clustering GTU 2024 Q3c / 2023W Q3c
Creates a tree of clusters called a dendrogram. Does not require specifying k. User can cut the dendrogram at any level to get any number of clusters.

1. Agglomerative (Bottom-up) — Most Common:

  1. Start: each point is its own cluster.
  2. Find the two closest clusters and merge them.
  3. Repeat until all points are in one cluster.
  4. Cut the dendrogram at desired level to get clusters.

Linkage Criteria (how to measure distance between clusters):

LinkageDistance Between ClustersEffect
SingleMin distance between any two pointsChaining effect, elongated clusters
CompleteMax distance between any two pointsCompact, spherical clusters
AverageAverage of all pairwise distancesBalanced compromise
Ward'sMinimizes total within-cluster varianceCompact, equal-size clusters

2. Divisive (Top-down): Start with all points in one cluster, recursively split into smaller clusters. Less common.

Strengths: No need to specify k, produces intuitive dendrogram, can capture hierarchical structure.
Weaknesses: O(n²) time complexity, sensitive to noise, cannot undo a wrong merge.

Random Forest Algorithm GTU 2023W Q2c / 2024 Q2c / 2025S Q2c
Random Forest
An ensemble learning method that builds multiple decision trees during training and combines their outputs (majority vote for classification, average for regression). "Random" because each tree is built on a random subset of data and features.

How It Works:

  1. Bootstrap Sampling (Bagging): Create n random subsets of the training data (with replacement). Train one decision tree on each subset.
  2. Feature Randomness: At each node split, only a random subset of features is considered (typically √total_features for classification). This decorrelates the trees.
  3. Prediction:
    • Classification: Majority vote across all trees.
    • Regression: Average of all tree predictions.
Training Data
     │
     ├── Bootstrap Sample 1 → Decision Tree 1 → Prediction 1
     ├── Bootstrap Sample 2 → Decision Tree 2 → Prediction 2
     ├── Bootstrap Sample 3 → Decision Tree 3 → Prediction 3
     │        ...                  ...               ...
     └── Bootstrap Sample n → Decision Tree n → Prediction n
                                                      │
                                              Majority Vote / Average
                                                      │
                                               FINAL PREDICTION

Key Hyperparameters:

  • n_estimators: Number of trees. More trees = better but slower.
  • max_features: Number of features considered at each split.
  • max_depth: Maximum depth of each tree (controls overfitting).

Advantages:

  • Handles high-dimensional data well.
  • Resistant to overfitting (due to averaging).
  • Can compute feature importance.
  • Works well even without parameter tuning.
  • Handles missing values and maintains accuracy for large datasets.

Disadvantages:

  • Not interpretable (black box).
  • Computationally expensive for very large forests.
  • Slower than a single decision tree.
GTU asks "Explain working of Random Forest" almost every year. Always include the diagram above and the 3-step process (Bootstrap → Tree → Vote).
Chapter 3

Modeling Instances (Classification & Nearest Neighbor)

Clustering vs Classification GTU 2023S Q1b / 2023W Q3b / 2024W Q2a
FeatureClusteringClassification
Learning TypeUnsupervised (no labels)Supervised (labeled data)
GoalDiscover natural groupsAssign data to known classes
OutputGroups/clusters (unlabeled)Class labels
Prior KnowledgeNo class labels neededTraining data with labels required
ProcessSingle step (no training)Two-step: train model → predict
ExampleGroup customers by behaviorIdentify if customer will churn (yes/no)
AlgorithmsK-Means, DBSCAN, HierarchicalKNN, Decision Tree, Random Forest

Classification — Two-Step Process:

  1. Training (Model Construction): A classifier is built using labeled training data. The model learns patterns that distinguish between classes.
  2. Testing (Prediction): The trained model is applied to new, unseen data to predict class labels. Accuracy is evaluated on the test set.
Overfitting — Problem & Solutions GTU 2023S Q3a / 2025S Q3a
Overfitting
Overfitting occurs when a model learns the training data too well — including its noise and random fluctuations — and performs poorly on new, unseen data. The model is too complex.
Underfitting (opposite)
Model is too simple to capture the underlying patterns. Performs poorly on both training and test data.

Signs of Overfitting: Very high training accuracy, much lower test accuracy. The gap between training error and test error is large.

Causes: Too many features, too deep a model, too small a training dataset, training for too long.

Methods to Avoid Overfitting:

  1. Cross-Validation: Use k-fold CV to evaluate model on multiple subsets of data.
  2. Pruning: In decision trees, remove branches that have little power to classify data.
  3. Regularization: Add a penalty for model complexity (L1/Lasso, L2/Ridge).
  4. Feature Selection: Remove irrelevant or redundant features.
  5. Get More Data: More training examples help the model generalize.
  6. Ensemble Methods: Combine many models (e.g., Random Forest) — averaging reduces variance.
  7. Early Stopping: In neural networks, stop training when validation error starts increasing.
  8. Dropout: In neural networks, randomly deactivate neurons during training.
Overgeneralization (opposite problem)
Model generalizes too broadly and misses important distinctions. E.g., model classifies all animals as "dog" because most training examples were dogs.
Nearest Neighbor Analysis GTU 2023W Q3c / 2024W Q3b
Nearest Neighbor Analysis
A technique that identifies the closest data points to a query point based on a distance metric. Used to analyze spatial distribution, classify new points, or find similar records.

Basic Concept: Given a new unknown point, find the k most similar (nearest) known points and use their characteristics to predict or describe the unknown.

Applications:

  • Geographic analysis (nearest hospital, nearest store).
  • Recommendation systems (users similar to you liked X).
  • Anomaly detection (point with no near neighbors = outlier).
  • Data imputation (fill missing values with nearest neighbor's value).

Types of NN Analysis:

  1. 1-NN: Find the single nearest neighbor.
  2. K-NN: Find k nearest neighbors and vote.
  3. Average NN: Use average of multiple neighbors.
  4. Radius-NN: All points within a distance r.
Average Nearest Neighbor (ANN) Algorithm GTU 2023S Q3c / 2025S Q3b
ANN Algorithm
Instead of using the single nearest neighbor, ANN calculates the average distance to all neighbors (or a set number of neighbors) and uses the average for classification or prediction. Reduces the impact of noisy/outlier neighbors.

How ANN Works:

  1. For a new query point q, find all neighbors within a radius or find k nearest neighbors.
  2. Compute the average feature values (or average distance) of those neighbors.
  3. Assign the class that is most represented on average.
  4. Or, for regression, predict the average target value of the neighbors.

Example:

ANN Example
Training data (with class labels):
P1(1,2) → Class A | P2(2,3) → Class A | P3(3,1) → Class B
P4(5,4) → Class B | P5(6,5) → Class B
Query point: Q(3,3), k=3
d(Q,P1) = √(4+1) = 2.24
d(Q,P2) = √(1+0) = 1.00
d(Q,P3) = √(0+4) = 2.00
d(Q,P4) = √(4+1) = 2.24
d(Q,P5) = √(9+4) = 3.61
3 nearest: P2(A,1.00), P3(B,2.00), P1(A,2.24)
Average distance = (1.00+2.00+2.24)/3 = 1.75
Class A has 2 votes, Class B has 1 vote → Classify Q as Class A

ANN vs Standard NN: Standard NN uses only the single closest point and is sensitive to noise. ANN uses multiple neighbors, making it more robust. However, ANN requires more computation.

K-Nearest Neighbor (KNN) Algorithm GTU 2024 Q3c / 2024W Q2c / 2025S Q2 OR / 2025W Q3c
K-Nearest Neighbor
A simple, non-parametric, lazy learning algorithm. For classification: assigns the class of the majority of k nearest training points. For regression: predicts the average value of k nearest points. No explicit training phase — it memorizes all training data.

Algorithm Steps:

  1. Choose the value of k.
  2. Calculate the distance from the new point to every training point.
  3. Sort the distances and identify k nearest neighbors.
  4. For classification: take majority vote among k neighbors.
  5. For regression: compute average of k neighbors' values.
KNN Numerical Example (Classification, k=3)
Training dataset:
ID | Height | Weight | Class
───────────────────────────
P1 | 150 | 50 | Thin
P2 | 160 | 60 | Normal
P3 | 170 | 70 | Normal
P4 | 180 | 90 | Fat
P5 | 190 | 100 | Fat
New person: Height=165, Weight=65. k=3
Euclidean distances:
d(P1) = √((165−150)²+(65−50)²) = √(225+225) = 21.2
d(P2) = √((165−160)²+(65−60)²) = √(25+25) = 7.07
d(P3) = √((165−170)²+(65−70)²) = √(25+25) = 7.07
d(P4) = √((165−180)²+(65−90)²) = √(225+625) = 29.2
d(P5) = √((165−190)²+(65−100)²) = √(625+1225)= 43.0
3 Nearest: P2(Normal,7.07), P3(Normal,7.07), P1(Thin,21.2)
Votes: Normal=2, Thin=1
Prediction: New person is → Normal

Choosing k:

  • Small k (k=1): Very sensitive to noise, complex boundary.
  • Large k: Smoother boundary, may include irrelevant neighbors.
  • Rule of thumb: k = √n. Always choose odd k for binary classification (avoid ties).
  • Best: Use cross-validation to find optimal k.
FeatureKNN
TypeSupervised, lazy learning
TrainingNone — memorizes data
PredictionSlow (computes all distances)
AdvantagesSimple, no training time, naturally handles multi-class
DisadvantagesSlow prediction, sensitive to irrelevant features, needs feature scaling
ApplicationsRecommendation systems, medical diagnosis, image recognition
KNN is asked in virtually every GTU exam. Always draw the distance table clearly, show the 3 nearest, and state the majority vote. Feature scaling (normalization) is important — mention it.

Feature Scaling in KNN: Since KNN is distance-based, attributes with larger ranges dominate. Always normalize features before applying KNN. Use Min-Max normalization: x' = (x − min) / (max − min).

Real-World Problems — E-commerce Applications GTU 2023S Q1c / 2023W Q5c

Data Science for E-commerce Growth:

  1. Customer Segmentation (Clustering): K-Means segments customers by RFM (Recency, Frequency, Monetary value). Different marketing strategies for each segment (VIP customers, discount-seekers, churned customers).
  2. Recommendation Systems (KNN / Collaborative Filtering): "Customers like you also bought X." Netflix-style recommendations based on user similarity. Increases average order value.
  3. Churn Prediction (Classification): Predict which customers are likely to stop buying. Logistic regression / Random Forest on behavioral data. Proactive retention: targeted discounts, emails.
  4. Fraud Detection (Outlier Detection / Classification): Unusual transactions flagged as potential fraud. Isolation Forest, DBSCAN for anomaly detection. Saves revenue and protects customers.
  5. Demand Forecasting (Time Series): ARIMA/LSTM models predict future demand. Optimizes inventory, reduces stockouts and overstock.
  6. A/B Testing (Inferential Statistics): Test two versions of a webpage and determine which converts better. Statistical significance testing (p-values, confidence intervals).
  7. Price Optimization (Regression): Predict optimal prices using regression models on competitor prices, demand elasticity, and seasonality.
  8. AARRR Funnel (E-commerce Metrics):
    • Acquisition: How do users find you?
    • Activation: Do they have a good first experience?
    • Retention: Do they come back?
    • Referral: Do they tell others?
    • Revenue: Do they pay?
For e-commerce questions, always mention: customer segmentation, churn prediction, recommendation systems, and fraud detection — these 4 cover most GTU answers.