Simple linear regression involves two variables where one independent (X column) and one dependent (Y column) values. In this article, we try to predict the percentage of marks student score based on the number of hours of study. We implement a python code with the help of Scikit-Learn machine learning library. All code is executed under Python 3.5 Jupyter NoteBook.
Most of Machine Learning algorithm code will split into below 6 steps.
Import the necessary Python Library.
Store the data(csv) into variable.
Split data into training and testing data.
Fit Regression model to training data and predict for test data.
Visualize plot between training/test data with respect to the regression line.
Calculate R-square and Root mean square error.
linear regression machine learning python code
Let’s start by importing necessary library.
Pandas is ML library which covers all file handling task (like read/write file).
Numpy is ML library which covers all the mathematical calculation.
Matplotlib Library cover plots and graph.
In this cell, we will import a student_scores CSV file with one dependent feature and one independent feature. This data gives a relation between a number of hour student study and how much percentage score in the exam. With the help of panda library store data in data variable. Next, we store independent feature to x variable. And the dependent feature to Y variable. iloc in Pandas used for integer location-based indexing it requires [rows, columns]. As we know in python array start with 0. In [:, :-1] : represent all row and -1 represent all columns without last column(Last column exclude).
A good regressor model will perform well with seen and unseen data, hence we need to test our regression model for training and unseen data (test data). In this cell, we are going to split the overall data to training and test data by using Scikit-Learn method train_test_split(). For splitting 70% will be used for training data and 30% data we consider for testing.
x.shape and y.shape will provide the rows and column size and describe will give the overall data description like (mean, max, count).hist() function will show the histogram graph between Hours and Scores.
In this cell, we will plot a scattered plot with title and x label and y label.
We need import specific library which can perform all the background calculation related to model building and so we’re going to import from scikit learn which contains several libraries and tools to make machine learning models and so the library that we’re going to use for simple regression is the linear_model library from scikit learn. And then in this library, we are going to import a class which is the LinearRegression class then we create an object of this class name regressor And we will fit this object to the training set(x_train, y_train).
Actually, in order to fit it to the training set, we will use something called a fit method because the linear regression class contains several methods and one of them is the fit method and this method is like a function that will fit your regressors objectives will create to the training set.
In this cell, by using the Pandas data frame we will get the difference between actual values and our regression predicted value. So by the result, we can see how close model predict with respect to actual values.
In this cell, we plot a graph between independent training data and regression predicted data. And for better understanding will give title and labels.
Next, we will test our regression model with unseen data or test data points, so will use scatter function to show the test data points on graph and plot function will plot a regression line which we trained for training data. As we can see from the graph for unseen data also our model gives a good prediction.
In this cell, we try to calculate r-square and Root mean square error value for our model So based on the calculation R-square for this model is 95% which is quite good, it means model understand the dataset very well and capable to a provide a good prediction for unseen data.
Exercise 1:- Let create a regression model which can predict the salary of employ based on total no of year experience. For Dataset Salary_Data CSV.
Exercise 2:- Now if you able to find the optimal Regression model for the previous dataset, let take another case study of TV ads marketing with respect to sales. For Dataset Tv-Salse.csv.
Wrapping Up:- So this is what Machine learning mode is, we created a machine which name is a regressor (object name) and we apply this machine learning model on the training set to understand the correlations between the exam percentage and the number of hours studies so that this machine based on its learning experience can then predict the percentage with respect to the hours studies.