Classification Algorithms Machine Learning and compare ML model

As we know the Machine learning algorithm is divided into supervise and unsupervised learning. Classification algorithms is a part of supervised learning where we train the Machine learning model with labeled output data and test the model performance by comparing with existing test data. There are many classification algorithms present in the machine learning library, but for this article will use Logistic Regression, KNN, SVM, Decision tree, Random forest etc. python 3+ and compare the result. Classification output means where output is either True or False, Yes or No, Positive or negative or 0 or 1.

For this article, we are using a sample dataset of fertility

https://www.kaggle.com/gabbygab/fertility-data-set

Dataset description:-

100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration is related to socio-demographic data, environmental factors, health status, and life habits.

The input X features are identified, which consist of the season, age, childish diseases, accident or serious trauma, surgical intervention, high fevers in the last year, frequency of alcohol consumption, number of hours spent sitting per day and predicted output is diagnosis Normal or Altered.

Let’s build a machine learning model that predict for what all condition affects for Spearm diagnosis result.

Let’s import the necessary library pandas for importing data, numpy for mathematic calculation and sklearn for ML models and matplotlib and seaborn for graph representation.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import warnings
warnings.filterwarnings('ignore')

Import Fertility dataset

Let’s import the dataset by reading the fertility.csv with the help of pandas library and store it to an object df and have a look of dataset top 5 rows.

df = pd.read_csv('fertility.csv')
df.head()

Machine learning feature engineering

As we saw the dataset rows let’s find out the number of unique rows in each column to understand which all columns have categorical values which we need to convert as a data engineering phase.

for col in df.columns.values:
    print(col, df[col].unique())
Season ['spring' 'fall' 'winter' 'summer']
Age [30 35 27 32 36 29 33 28 31 34]
Childish diseases ['no' 'yes']
Accident or serious trauma ['yes' 'no']
Surgical intervention ['yes' 'no']
High fevers in the last year ['more than 3 months ago' 'less than 3 months ago' 'no']
Frequency of alcohol consumption ['once a week' 'hardly ever or never' 'several times a week'
 'several times a day' 'every day']
Smoking habit ['occasional' 'daily' 'never']
Number of hours spent sitting per day [ 16   6   9   7   8   5   2  11   3 342  14  18  10   1]
Diagnosis ['Normal' 'Altered']

By running the command info we can find out the columns having missing values and data type (int64, Object, Folat, etc.) fortunately we don’t have any missing value or null value in our dataset.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 100 entries, 0 to 99
Data columns (total 10 columns):
Season                                   100 non-null object
Age                                      100 non-null int64
Childish diseases                        100 non-null object
Accident or serious trauma               100 non-null object
Surgical intervention                    100 non-null object
High fevers in the last year             100 non-null object
Frequency of alcohol consumption         100 non-null object
Smoking habit                            100 non-null object
Number of hours spent sitting per day    100 non-null int64
Diagnosis                                100 non-null object
dtypes: int64(2), object(8)
memory usage: 7.9+ KB

Now assign all input features columns from df by dropping Diagonis Coulon to x object and output Diagonis to y object.

x = df.drop('Diagnosis',axis=1)
y = df['Diagnosis']

Let see the shape of x and y to get the number of row and columns details.

x.shape,y.shape
((100, 9), (100,))

Now we need to convert categorical values to numerical values since the classification algorithm requires numerical value to fit and calculate the prediction, there are two popular way to achieve this either using one-hot encoder or get_dummies method, so we are using pd.get_dummies method for this conversion since it is simple and then drop last column to avoid dummy variable trap.

# convert categorical values to numerical values 
x['Childish diseases'] = pd.get_dummies(x['Childish diseases'])
x['Accident or serious trauma'] = pd.get_dummies(x['Accident or serious trauma'])
x['Surgical intervention'] = pd.get_dummies(x['Surgical intervention'])
Season = pd.get_dummies(x['Season'],drop_first=True)
High_fevers = pd.get_dummies(x['High fevers in the last year'],drop_first=True)
alcohol_consumption = pd.get_dummies(x['Frequency of alcohol consumption'],drop_first=True)
Smoking = pd.get_dummies(x['Smoking habit'],drop_first=True)

# add Season,High_fevers,alcohol_consumption,Smoking column to x dataframe
df_new = pd.concat([x,Season,High_fevers,alcohol_consumption,Smoking],axis=1)

# Show top 5 rows with new columns
df_new.head()

Since we have already added dummy values for Season,High_fevers,alcohol_consumption,Smoking column to x dataframe df_new, its time to remove or original columns by using parameter axis = 1.

# Drop original columns

df_new = df_new.drop('High fevers in the last year',axis=1)
df_new = df_new.drop('Frequency of alcohol consumption',axis=1)
df_new = df_new.drop('Smoking habit',axis=1)
df_new = df_new.drop('Number of hours spent sitting per day',axis=1)
x = df_new.drop('Season',axis=1)

# Let's look new rows and columns
x.head()

data

Since our x dataframe fully convert from categrial values to numerical values, it time to convert y column values we can use same get_dummy method but for simplicity I am using replace function by replacing 0 to Normal and 1 to Altered.

# Convert categorical values to numerical values
#y = pd.get_dummies(y,drop_first=True)

y = y.replace('Normal',0)
y = y.replace('Altered',1)

After converting y column we can clearly see our output is an imbalance 88 Normal (0) and only 12 Altered (1). In this condition, ML model has sufficient data points to learn Normal condition but very fewer data points to learn Altered condition.

box plot

Split data to train and Test

Now its time to convert out dataframe to train and test since ML model will learn from train datasets and then predict output for unseen test datasets. Since our dataset having less value we are using 80% random data for training and 20%  random data for the test, we can accomplish this by using SkLearn train_test_split library.

#spliting dataset to train and test
from sklearn.model_selection import train_test_split
x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.2,random_state = 26)

# Store x_test to some variable for further use
x_test_data=x_test

Since our data having different scale parameters like Age calculate in year and rest other parameters are just 0 and 1 it is possible Age parameter dominate rest other values, so we need to convert it to unique scale by applying StandardScaler to all columns.

box plot for all data

SMOTE Oversampling 

Once out datapoints scaled it time to Handel oversampling problem for that we are using SMOTE module from imblearn.over_sampling class. Which it synthetically improve lower class so after applying SMOTE both 0 an 1 total count would be equal.

smote

Machine learning classification models

Finally after all data cleaning, Scaling and other steps we reach to point where we need to test our machine learning algorithm performance by fitting train data.

# import ML models from sklearn.
from sklearn.metrics import f1_score
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier 
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
# For XgBoost first install xgboost using pip
import xgboost as xgb

# Assign ML model to object
lr = LogisticRegression(solver='lbfgs',C=10,max_iter=10)
clf = SVC() 
knn =  KNeighborsClassifier(n_neighbors=20)
dtree = DecisionTreeClassifier(criterion="entropy", max_depth=4)
rfc = RandomForestClassifier(max_depth=5,n_estimators=2)
xg = xgb.XGBClassifier(max_depth=3)

Now Let’s fit all ML models to x_train and y_train dataset and calculate total time to fit the model and find the best F1 score for every model. Since our dataset is very small ML model not take much time to Fit but for large dataset it takes hours or maybe days to fit and find F1 score.

#Let's fit ML model to x_train and y_train dataset and calculate total time to fit the model
import time
start = time.time()
def scorer(i,j,k,l,m,n):
    for model in (i,j,k,l,m,n):
        start = time.time()
        
        model.fit(x_train,y_train)
        
        print (model.__class__.__name__, 'F1 score =', f1_score(y_test,model.predict(x_test)))
        end = time.time()
        temp = end-start
        hours = temp//3600
        temp = temp - 3600*hours
        minutes = temp//60
        seconds = temp - 60*minutes
        print (model.__class__.__name__, 'Total time taken to Fit =','%d:%d:%d' %(hours,minutes,seconds))
scorer (lr,clf,knn,dtree,rfc,xg)
LogisticRegression F1 score = 0.25
LogisticRegression Total time taken to Fit = 0:0:0
SVC F1 score = 0.0
SVC Total time taken to Fit = 0:0:0
KNeighborsClassifier F1 score = 0.375
KNeighborsClassifier Total time taken to Fit = 0:0:0
DecisionTreeClassifier F1 score = 0.39999999999999997
DecisionTreeClassifier Total time taken to Fit = 0:0:0
RandomForestClassifier F1 score = 0.2
RandomForestClassifier Total time taken to Fit = 0:0:0
XGBClassifier F1 score = 0.3333333333333333
XGBClassifier Total time taken to Fit = 0:0:0

Since our, all ML model learns from training data points. It’s time to calculate model performance by predict model output and compare the actual test output For that we can use Sklearn library to calculate Accuracy, Precision,Recall,RMSE.

#Model predictions
from sklearn.metrics import mean_squared_error
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix

from sklearn import metrics
for i in (lr,clf,knn,dtree,rfc,xg):
    
    y_pred = i.predict(x_test)
    
    print('\n')
    print(i.__class__.__name__)
    print(confusion_matrix(y_test, y_pred))
    print('accuracy:',accuracy_score(y_test, y_pred))
    print('Precision:',precision_score(y_test, y_pred))
    print('Recall:',recall_score(y_test, y_pred))
    print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred)))
    print('#############################################################')
LogisticRegression
[[13  3]
 [ 3  1]]
accuracy: 0.7
Precision: 0.25
Recall: 0.25
RMSE: 0.5477225575051661
#############################################################

SVC
[[15  1]
 [ 4  0]]
accuracy: 0.75
Precision: 0.0
Recall: 0.0
RMSE: 0.5
#############################################################

KNeighborsClassifier
[[7 9]
 [1 3]]
accuracy: 0.5
Precision: 0.25
Recall: 0.75
RMSE: 0.7071067811865476
#############################################################

DecisionTreeClassifier
[[8 8]
 [1 3]]
accuracy: 0.55
Precision: 0.2727272727272727
Recall: 0.75
RMSE: 0.6708203932499369
#############################################################

RandomForestClassifier
[[11  5]
 [ 3  1]]
accuracy: 0.6
Precision: 0.16666666666666666
Recall: 0.25
RMSE: 0.6324555320336759
#############################################################

XGBClassifier
[[ 5 11]
 [ 1  3]]
accuracy: 0.4
Precision: 0.21428571428571427
Recall: 0.75
RMSE: 0.7745966692414834
#############################################################

As we can see from Model output Logistic Regression and Random Forest gives good accuracy as well as good precision and recall score and if we look confusion matrics true positive and true negative values are even better for Logistic Regression and Random Forest model.

LogisticRegression
[[13 3]
[ 3 1]]

RandomForestClassifier
[[11 5]
[ 3 1]]

So if we manually calculate out of total 20 test points where 16 => Normal (0) and 4 => Altered (1). Logistic regression predicts 13 true negatives (Normal) and 1 true positive (Altered) similarly Random Forest predict 11 true negatives (Normal) and 1 true positive (Altered), but both models predict some point false.

y_pred calculation

Compare Actual Vs Model prediction

We can do side by side comparison between actual and model prediction.

comp = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred})
comp

Index Actual  Predicted
84	1	0
26	1	1
99	0	0
14	0	0
40	0	0
75	0	0
5	0	0
2	0	0
60	0	1
9	0	0
45	0	0
69	0	1
50	0	0
1	1	0
43	0	0
32	0	0
4	1	0
87	0	0
44	0	0
33	0	1
actual vs prediction model output

As we can see from above plot at index number 26 actual and model prediction both are TRUE.

We can combine the x_test_data and y_test point by using pd.concat and save result to data variable then compare y_pred result to understand which row value model is not able to understand, and how can we improve our model performance which all condition requires for model to understand true positive (Altered) case.

model output

 

Practice Exercise:- It’s time to do some hands projects on Machine learning classification problem by using Datasets :-

https://archive.ics.uci.edu/ml/datasets.php

Wrapping up:- In this tutorial we understand how to

Classification Algorithms Machine Learning work and compare machine learning algorithms, we can further use parameter tuning to get good results.

Hope you enjoy this article, Please share and comments.

You May Also Like

Leave a Reply

Your email address will not be published. Required fields are marked *