Post

Knn Diabetes Model

Knn Diabetes Model

๐Ÿง  K-Nearest Neighbors (KNN) on Diabetes Dataset

This notebook demonstrates how to build a KNN model on the Pima Indians Diabetes Dataset using Scikit-learn. Weโ€™ll go through data cleaning, preprocessing, model building, and evaluation.


๐Ÿ“Š 1. Import Libraries

1
2
3
4
5
6
7
8
9
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

๐Ÿ“ฅ 2. Load the Dataset

1
2
df = pd.read_csv('diabetes.csv')
df.head()

๐Ÿ” 3. Missing Data Handling

1
df.replace(0, np.nan, inplace=True)

After replacing 0s with NaN for medically invalid entries in columns like Glucose, BloodPressure, SkinThickness, Insulin, BMI:

1
df.isnull().sum()

๐Ÿ“Œ Observation & Decision

  • Insulin had more than 300 missing values out of 768 rows โ€” ~40% missing!
  • Other columns had minor missing entries.

๐Ÿ’ก Decision:
We decided to fill missing values with median

1
2
3
4
5
6
7
8
9
10
11
# Replace invalid 0s with NaNs  
df.replace(0, np.nan, inplace=True)

# List of columns where 0 is invalid  
cols_with_zeros = ['Glucose', 'BloodPressure', 'SkinThickness', 'Insulin', 'BMI']

# Fill missing values with median (safe for skewed medical data)  
df[cols_with_zeros] = df[cols_with_zeros].fillna(df[cols_with_zeros].median())

# Confirm again โ€” should all be zero  
print(df.isnull().sum())

๐Ÿงผ 4. Feature Scaling

Before feeding the data into our model, we need to normalize the feature values. Thatโ€™s because K-Nearest Neighbors is a distance-based algorithm โ€” it calculates the distance (usually Euclidean) between data points to classify them.

But imagine this:

  • โ€œInsulinโ€ values range from 0 to 800+
  • โ€œBMIโ€ is around 20โ€“50
  • โ€œAgeโ€ ranges from 20 to 80
  • โ€œGlucoseโ€ might be around 70โ€“200

โžก๏ธ Without scaling, features with larger ranges dominate the distance calculation โ€” and thatโ€™s not fair.

โœ… Solution: StandardScaler We scale all features to have:

  • Mean = 0
  • Standard deviation = 1

This ensures every feature contributes equally to the KNN distance calculation.

1
2
3
4
5
X = df.drop('Outcome', axis=1)
y = df['Outcome']

scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

๐Ÿงช 5. Train-Test Split

1
X_train, X_test, y_train, y_test = train_test_split(X_scaled, y, test_size=0.2, random_state=42)

๐Ÿค– 6. KNN Model Training and Evaluation

Try different values of k from 1 to 20 and find the best one.

1
2
3
4
5
for k in range(1, 21):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    score = model.score(X_test, y_test)
    print("k=%d, accuracy=%f" % (k, score))

๐Ÿ“ˆ Best Result: k=11 with highest accuracy ~0.759


๐Ÿ“Œ 7. Final Model with Best k

1
2
3
model = KNeighborsClassifier(n_neighbors=11)
model.fit(X_train, y_train)
y_pred = model.predict(X_test)

๐Ÿ“‰ 8. Evaluation Metrics

1
2
3
4
5
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Classification Report:
", classification_report(y_test, y_pred))
print("Confusion Matrix:
", confusion_matrix(y_test, y_pred))

๐Ÿ“Š 9. Accuracy Plot for k values

1
2
3
4
5
6
7
8
9
10
11
12
scores = []
for k in range(1, 21):
    model = KNeighborsClassifier(n_neighbors=k)
    model.fit(X_train, y_train)
    scores.append(model.score(X_test, y_test))

plt.plot(range(1, 21), scores)
plt.xlabel("k")
plt.ylabel("Accuracy")
plt.title("Accuracy vs K value")
plt.grid()
plt.show()

โœ… Summary

  • Preprocessed the data, handled missing values smartly
  • Replaced missing data with median
  • Scaled features and split dataset
  • Tuned KNN model to find best k
  • Achieved ~76% accuracy on test set

This post is licensed under CC BY 4.0 by the author.