๐ง K-Nearest Neighbors (KNN) on Diabetes Dataset
This notebook demonstrates how to build a KNN model on the Pima Indians Diabetes Dataset using Scikit-learn. Weโll go through data cleaning, preprocessing, model building, and evaluation.
๐ 1. Import Libraries
1
2
3
4
5
6
7
8
9
| import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report
|
๐ฅ 2. Load the Dataset
1
2
| df = pd.read_csv('diabetes.csv')
df.head()
|
๐ 3. Missing Data Handling
1
| df.replace(0, np.nan, inplace=True)
|
After replacing 0s with NaN for medically invalid entries in columns like Glucose, BloodPressure, SkinThickness, Insulin, BMI:
๐ Observation & Decision
Insulin had more than 300 missing values out of 768 rows โ ~40% missing!- Other columns had minor missing entries.
๐ก Decision:
We decided to fill missing values with median
1
2
3
4
5
6
7
8
9
10
11
| # Replace invalid 0s with NaNs
df.replace(0, np.nan, inplace=True)
# List of columns where 0 is invalid
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']
# Fill missing values with median (safe for skewed medical data)
df[cols_with_zeros] = df[cols_with_zeros].fillna(df[cols_with_zeros].median())
# Confirm again โ should all be zero
print(df.isnull().sum())
|
๐งผ 4. Feature Scaling
Before feeding the data into our model, we need to normalize the feature values. Thatโs because K-Nearest Neighbors is a distance-based algorithm โ it calculates the distance (usually Euclidean) between data points to classify them.
But imagine this:
- โInsulinโ values range from 0 to 800+
- โBMIโ is around 20โ50
- โAgeโ ranges from 20 to 80
- โGlucoseโ might be around 70โ200
โก๏ธ Without scaling, features with larger ranges dominate the distance calculation โ and thatโs not fair.
โ
Solution: StandardScaler We scale all features to have:
- Mean = 0
- Standard deviation = 1
This ensures every feature contributes equally to the KNN distance calculation.
1
2
3
4
5
| X = df.drop('Outcome', axis=1)
y = df['Outcome']
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
|
๐งช 5. Train-Test Split
1
| X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)
|
๐ค 6. KNN Model Training and Evaluation
Try different values of k from 1 to 20 and find the best one.
1
2
3
4
5
| for k in range(1, 21):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
score = model.score(X_test, y_test)
print("k=%d, accuracy=%f" % (k, score))
|
๐ Best Result: k=11 with highest accuracy ~0.759
๐ 7. Final Model with Best k
1
2
3
| model = KNeighborsClassifier(n_neighbors=11)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
|
๐ 8. Evaluation Metrics
1
2
3
4
5
| print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:
", classification_report(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))
|
๐ 9. Accuracy Plot for k values
1
2
3
4
5
6
7
8
9
10
11
12
| scores = []
for k in range(1, 21):
model = KNeighborsClassifier(n_neighbors=k)
model.fit(X_train, y_train)
scores.append(model.score(X_test, y_test))
plt.plot(range(1, 21), scores)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Accuracy vs K value")
plt.grid()
plt.show()
|
โ
Summary
- Preprocessed the data, handled missing values smartly
- Replaced missing data with median
- Scaled features and split dataset
- Tuned KNN model to find best
k - Achieved ~76% accuracy on test set