R Square formula value shows how close data point is to the fitted regression line, it also known as the coefficient of determination or coefficient of multiple determination. The coefficient of determination R-square reflects the proportion of variance of one variable which is predictable from another variable. It is the ratio of explained variation to total variation.

Let’s try to understand the working of**R square formula with some mathematical calculation**. Let take previous article Ordinary Least Squares datapoint example X-axis value (1,2,3,4,5) and Y-axis value (3,4,2,4,5). To simplify the mathematical calculation make a table.

# R squared formula

Where Yp is Y predicted value.

Let calculate the value based on the above formula. Mean of Y value = 3.6SSe = Mean of

**= 1.6 / 5.2 = 0.3**R-square formula value can vary between 0 to 1 if R-square value is close to 0 mean its not good regression model and if R-square value close to 1 means good model, if R-square value = 1 means X Y value point are same as predicted value point which is not possible in real time because of noise in data or some other factors in input data points. If R-square = 0.85 the 85% of total variation can be cover by the model rest 15% of the variation in Y is random and not reflected by the model. R-square value can be increased by increasing the independent features like X1, X2, X3 which equally contributes to predicting for the dependent value of Y, but the same time we even need to look of overfitting and underfitting of the regression model. There is a variety of error for all those points that don’t fall exactly on the line. It is impossible to understand these error to judge the goodness of fit of the model i.e How Representative the model is likely to be in general.

- P1 – Original Y data point for given X.
- P2 – Estimated Y value for given X.
- Y bar – Average of all Y values in the data set.
- SST- Sum of square error Total(SST) variance of P1 from Y bar (Y – Y bar)^2.
- SSE – Explained error (p2 – Ybar)^2 (Portion SST captured by regression model).
- SSR – Residual error (P1 -P2)^2.

**Residual error = Predicted value – Actual value**Our goal is to reduce Residual error

SSr = Residual error

SSe = Explained error

## R squared formula

- That model is most fit where every data point lies on the line i.e SSR = 0 for all data points.
- Hence SSE should be equal to SST i.e SSE/SST should be 1.
- A poor fit will mean large SSR (since points do not fall on the line) hence SSE =0 therefor SSE/SST =0
- SSE/SST is called as R-Square or coefficient of determination
- R-square is always between 0 to 1 and is a measure of the utility of the regression model

#Python code to find OLS. import statsmodels.api as sm model=sm.OLS(y_train,x_train[['X1','X2','X3']]) result=model.fit() print(result.summary())

## Adjusted R-Square

Adjusted R-square shows the number of an active predictor in the model. Adjusted R-square is always less then R-square. Its value can be -Ve but not always. It is required to optimize the model accuracy. So In the above example, we understand by increasing the no of features (Xn) R-square value also increased. So at what point we should stop adding more artificial features.

If the difference between R-Square and Adjusted R-square increase it means our model start overfitting stage (Not increasing no of the row but try to learn more feature) in short n:m is not correct.

n = No of an independent feature.

m = No of row (total value for given feature).

**So we can consider a good Model which has less difference between R-square and Adjusted R-square value.**

For example :- Consider a model which has three independent features(Xn) CRIM, ZN, CHAS which predict the Y values and if we try to find the OLS values by using above python code it results in R-square value .380 or 38% which is close to zero, so it is not an optimal model, mean this model is not capable to predict dependent variable value Y with limited independent features of Xn. Data points are scattered and model regression line not able to capture efficiently.

This model is not best because the difference between R-square and Adjusted R-Square is more. So to get the best-predicted model we need to reduce the difference between R-square and Adjusted R-square to do so we need to increase the no of independent features.

So to increase the model efficiency we ask more features from a data source or otherwise artificially produce features(Xn). As we can see below screenshot R-square value increased from .380 to .936 or 93% which consider as a good model. Which mean most of the datapoint lay down near to regression line. In other words, This model is good because the difference between R-square and Adjusted R-square is less.

**Wrapping up:-** R-square value tell how well the regression model fits given data points in other hand Adjusted R-square value tells how features (X1, X2…Xn) important to your model.