As we know the Machine learning algorithm is divided into supervise and unsupervised learning. Classification algorithms is a part of supervised learning where we train the Machine learning model with labeled output data and test the model performance by comparing with existing test data. There are many classification algorithms present in the machine learning library, but for this article will use Logistic Regression, KNN, SVM, Decision tree, Random forest etc. python 3+ and compare the result. Classification output means where output is either True or False, Yes or No, Positive or negative or 0 or 1.

For this article, we are using a sample dataset of fertility.

https://www.kaggle.com/gabbygab/fertility-data-set

Dataset description:-

100 volunteers provide a semen sample analyzed according to the WHO 2010 criteria. Sperm concentration is related to socio-demographic data, environmental factors, health status, and life habits.

The input X features are identified, which consist of the season, age, childish diseases, accident or serious trauma, surgical intervention, high fevers in the last year, frequency of alcohol consumption, number of hours spent sitting per day and predicted output is diagnosis Normal or Altered.

Let’s build a machine learning model that predict for what all condition affects for Spearm diagnosis result.

Let’s import the necessary library pandas for importing data, numpy for mathematic calculation and sklearn for ML models and matplotlib and seaborn for graph representation.

import numpy as np import pandas as pd import matplotlib.pyplot as plt import seaborn as sns %matplotlib inline import warnings warnings.filterwarnings('ignore')

### Import Fertility dataset

Let’s import the dataset by reading the fertility.csv with the help of pandas library and store it to an object df and have a look of dataset top 5 rows.

df = pd.read_csv('fertility.csv') df.head()

### Machine learning feature engineering

As we saw the dataset rows let’s find out the number of unique rows in each column to understand which all columns have categorical values which we need to convert as a data engineering phase.

for col in df.columns.values: print(col, df[col].unique())

```
Season ['spring' 'fall' 'winter' 'summer']
Age [30 35 27 32 36 29 33 28 31 34]
Childish diseases ['no' 'yes']
Accident or serious trauma ['yes' 'no']
Surgical intervention ['yes' 'no']
High fevers in the last year ['more than 3 months ago' 'less than 3 months ago' 'no']
Frequency of alcohol consumption ['once a week' 'hardly ever or never' 'several times a week'
'several times a day' 'every day']
Smoking habit ['occasional' 'daily' 'never']
Number of hours spent sitting per day [ 16 6 9 7 8 5 2 11 3 342 14 18 10 1]
Diagnosis ['Normal' 'Altered']
```

By running the command info we can find out the columns having missing values and data type (int64, Object, Folat, etc.) fortunately we don’t have any missing value or null value in our dataset.

**df.info()**

<class 'pandas.core.frame.DataFrame'> RangeIndex: 100 entries, 0 to 99 Data columns (total 10 columns): Season 100 non-null object Age 100 non-null int64 Childish diseases 100 non-null object Accident or serious trauma 100 non-null object Surgical intervention 100 non-null object High fevers in the last year 100 non-null object Frequency of alcohol consumption 100 non-null object Smoking habit 100 non-null object Number of hours spent sitting per day 100 non-null int64 Diagnosis 100 non-null object dtypes: int64(2), object(8) memory usage: 7.9+ KB

Now assign all input features columns from df by dropping Diagonis Coulon to x object and output Diagonis to y object.

x = df.drop('Diagnosis',axis=1) y = df['Diagnosis']

Let see the shape of x and y to get the number of row and columns details.

x.shape,y.shape

`((100, 9), (100,))`

Now we need to convert categorical values to numerical values since the classification algorithm requires numerical value to fit and calculate the prediction, there are two popular way to achieve this either using one-hot encoder or get_dummies method, so we are using pd.get_dummies method for this conversion since it is simple and then drop last column to avoid dummy variable trap.

# convert categorical values to numerical values x['Childish diseases'] = pd.get_dummies(x['Childish diseases']) x['Accident or serious trauma'] = pd.get_dummies(x['Accident or serious trauma']) x['Surgical intervention'] = pd.get_dummies(x['Surgical intervention']) Season = pd.get_dummies(x['Season'],drop_first=True) High_fevers = pd.get_dummies(x['High fevers in the last year'],drop_first=True) alcohol_consumption = pd.get_dummies(x['Frequency of alcohol consumption'],drop_first=True) Smoking = pd.get_dummies(x['Smoking habit'],drop_first=True) # add Season,High_fevers,alcohol_consumption,Smoking column to x dataframe df_new = pd.concat([x,Season,High_fevers,alcohol_consumption,Smoking],axis=1) # Show top 5 rows with new columns df_new.head()

Since we have already added dummy values for Season,High_fevers,alcohol_consumption,Smoking column to x dataframe df_new, its time to remove or original columns by using parameter axis = 1.

# Drop original columns df_new = df_new.drop('High fevers in the last year',axis=1) df_new = df_new.drop('Frequency of alcohol consumption',axis=1) df_new = df_new.drop('Smoking habit',axis=1) df_new = df_new.drop('Number of hours spent sitting per day',axis=1) x = df_new.drop('Season',axis=1) # Let's look new rows and columns x.head()

Since our x dataframe fully convert from categrial values to numerical values, it time to convert y column values we can use same get_dummy method but for simplicity I am using replace function by replacing 0 to Normal and 1 to Altered.

# Convert categorical values to numerical values #y = pd.get_dummies(y,drop_first=True) y = y.replace('Normal',0) y = y.replace('Altered',1)

After converting y column we can clearly see our output is an imbalance 88 Normal (0) and only 12 Altered (1). In this condition, ML model has sufficient data points to learn Normal condition but very fewer data points to learn Altered condition.

### Split data to train and Test

Now its time to convert out dataframe to train and test since ML model will learn from train datasets and then predict output for unseen test datasets. Since our dataset having less value we are using 80% random data for training and 20% random data for the test, we can accomplish this by using SkLearn train_test_split library.

#spliting dataset to train and test from sklearn.model_selection import train_test_split x_train,x_test,y_train,y_test = train_test_split(x,y, test_size = 0.2,random_state = 26) # Store x_test to some variable for further use x_test_data=x_test

Since our data having different scale parameters like Age calculate in year and rest other parameters are just 0 and 1 it is possible Age parameter dominate rest other values, so we need to convert it to unique scale by applying StandardScaler to all columns.

### SMOTE Oversampling

Once out datapoints scaled it time to Handel oversampling problem for that we are using SMOTE module from imblearn.over_sampling class. Which it synthetically improve lower class so after applying SMOTE both 0 an 1 total count would be equal.

### Machine learning classification models

Finally after all data cleaning, Scaling and other steps we reach to point where we need to test our machine learning algorithm performance by fitting train data.

# import ML models from sklearn. from sklearn.metrics import f1_score from sklearn.linear_model import LogisticRegression from sklearn.svm import SVC from sklearn.neighbors import KNeighborsClassifier from sklearn.tree import DecisionTreeClassifier from sklearn.ensemble import RandomForestClassifier # For XgBoost first install xgboost using pip import xgboost as xgb # Assign ML model to object lr = LogisticRegression(solver='lbfgs',C=10,max_iter=10) clf = SVC() knn = KNeighborsClassifier(n_neighbors=20) dtree = DecisionTreeClassifier(criterion="entropy", max_depth=4) rfc = RandomForestClassifier(max_depth=5,n_estimators=2) xg = xgb.XGBClassifier(max_depth=3)

Now Let’s fit all ML models to x_train and y_train dataset and calculate total time to fit the model and find the best F1 score for every model. Since our dataset is very small ML model not take much time to Fit but for large dataset it takes hours or maybe days to fit and find F1 score.

#Let's fit ML model to x_train and y_train dataset and calculate total time to fit the model import time start = time.time() def scorer(i,j,k,l,m,n): for model in (i,j,k,l,m,n): start = time.time() model.fit(x_train,y_train) print (model.__class__.__name__, 'F1 score =', f1_score(y_test,model.predict(x_test))) end = time.time() temp = end-start hours = temp//3600 temp = temp - 3600*hours minutes = temp//60 seconds = temp - 60*minutes print (model.__class__.__name__, 'Total time taken to Fit =','%d:%d:%d' %(hours,minutes,seconds)) scorer (lr,clf,knn,dtree,rfc,xg)

```
LogisticRegression F1 score = 0.25
LogisticRegression Total time taken to Fit = 0:0:0
SVC F1 score = 0.0
SVC Total time taken to Fit = 0:0:0
KNeighborsClassifier F1 score = 0.375
KNeighborsClassifier Total time taken to Fit = 0:0:0
DecisionTreeClassifier F1 score = 0.39999999999999997
DecisionTreeClassifier Total time taken to Fit = 0:0:0
RandomForestClassifier F1 score = 0.2
RandomForestClassifier Total time taken to Fit = 0:0:0
XGBClassifier F1 score = 0.3333333333333333
XGBClassifier Total time taken to Fit = 0:0:0
```

Since our, all ML model learns from training data points. It’s time to calculate model performance by predict model output and compare the actual test output For that we can use Sklearn library to calculate Accuracy, Precision,Recall,RMSE.

#Model predictions from sklearn.metrics import mean_squared_error from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, classification_report, confusion_matrix from sklearn import metrics for i in (lr,clf,knn,dtree,rfc,xg): y_pred = i.predict(x_test) print('\n') print(i.__class__.__name__) print(confusion_matrix(y_test, y_pred)) print('accuracy:',accuracy_score(y_test, y_pred)) print('Precision:',precision_score(y_test, y_pred)) print('Recall:',recall_score(y_test, y_pred)) print('RMSE:',np.sqrt(mean_squared_error(y_test, y_pred))) print('#############################################################')

```
LogisticRegression
[[13 3]
[ 3 1]]
accuracy: 0.7
Precision: 0.25
Recall: 0.25
RMSE: 0.5477225575051661
#############################################################
SVC
[[15 1]
[ 4 0]]
accuracy: 0.75
Precision: 0.0
Recall: 0.0
RMSE: 0.5
#############################################################
KNeighborsClassifier
[[7 9]
[1 3]]
accuracy: 0.5
Precision: 0.25
Recall: 0.75
RMSE: 0.7071067811865476
#############################################################
DecisionTreeClassifier
[[8 8]
[1 3]]
accuracy: 0.55
Precision: 0.2727272727272727
Recall: 0.75
RMSE: 0.6708203932499369
#############################################################
RandomForestClassifier
[[11 5]
[ 3 1]]
accuracy: 0.6
Precision: 0.16666666666666666
Recall: 0.25
RMSE: 0.6324555320336759
#############################################################
XGBClassifier
[[ 5 11]
[ 1 3]]
accuracy: 0.4
Precision: 0.21428571428571427
Recall: 0.75
RMSE: 0.7745966692414834
#############################################################
```

As we can see from Model output Logistic Regression and Random Forest gives good accuracy as well as good precision and recall score and if we look confusion matrics true positive and true negative values are even better for Logistic Regression and Random Forest model.

LogisticRegression

[[13 3]

[ 3 1]]

RandomForestClassifier

[[11 5]

[ 3 1]]

So if we manually calculate out of total 20 test points where 16 => Normal (0) and 4 => Altered (1). Logistic regression predicts 13 true negatives (Normal) and 1 true positive (Altered) similarly Random Forest predict 11 true negatives (Normal) and 1 true positive (Altered), but both models predict some point false.

### Compare Actual Vs Model prediction

We can do side by side comparison between actual and model prediction.

comp = pd.DataFrame({'Actual': y_test, 'Predicted': y_pred}) comp

```
Index Actual Predicted
84 1 0
26 1 1
99 0 0
14 0 0
40 0 0
75 0 0
5 0 0
2 0 0
60 0 1
9 0 0
45 0 0
69 0 1
50 0 0
1 1 0
43 0 0
32 0 0
4 1 0
87 0 0
44 0 0
33 0 1
```

As we can see from above plot at index number 26 actual and model prediction both are TRUE.

We can combine the x_test_data and y_test point by using pd.concat and save result to data variable then compare y_pred result to understand which row value model is not able to understand, and how can we improve our model performance which all condition requires for model to understand true positive (Altered) case.

## Practice Exercise:- It’s time to do some hands projects on Machine learning classification problem by using Datasets :-

https://archive.ics.uci.edu/ml/datasets.php

Wrapping up:- In this tutorial we understand how to

Classification Algorithms Machine Learning work and compare machine learning algorithms, we can further use parameter tuning to get good results.

Hope you enjoy this article, Please share and comments.