R Square formula value shows how close data point is to the fitted regression line, it also known as the coefficient of determination or coefficient of multiple determination. The coefficient of determination R-square reflects the proportion of variance of one variable which is predictable from another variable. It is the ratio of explained variation to total variation.Let’s try to understand the working of R square formula with some mathematical calculation. Let take previous article Ordinary Least Squares datapoint example X-axis value (1,2,3,4,5) and Y-axis value (3,4,2,4,5). To simplify the mathematical calculation make a table.
R squared formula
Where Yp is Y predicted value.Let calculate the value based on the above formula. Mean of Y value = 3.6
SSe = Mean of = 1.6
- P1 – Original Y data point for given X.
- P2 – Estimated Y value for given X.
- Y bar – Average of all Y values in the data set.
- SST- Sum of square error Total(SST) variance of P1 from Y bar (Y – Y bar)^2.
- SSE – Explained error (p2 – Ybar)^2 (Portion SST captured by regression model).
- SSR – Residual error (P1 -P2)^2.
SSr = Residual error
SSe = Explained error
R squared formula
- That model is most fit where every data point lies on the line i.e SSR = 0 for all data points.
- Hence SSE should be equal to SST i.e SSE/SST should be 1.
- A poor fit will mean large SSR (since points do not fall on the line) hence SSE =0 therefor SSE/SST =0
- SSE/SST is called as R-Square or coefficient of determination
- R-square is always between 0 to 1 and is a measure of the utility of the regression model
#Python code to find OLS. import statsmodels.api as sm model=sm.OLS(y_train,x_train[['X1','X2','X3']]) result=model.fit() print(result.summary())
Adjusted R-square shows the number of an active predictor in the model. Adjusted R-square is always less then R-square. Its value can be -Ve but not always. It is required to optimize the model accuracy. So In the above example, we understand by increasing the no of features (Xn) R-square value also increased. So at what point we should stop adding more artificial features.
If the difference between R-Square and Adjusted R-square increase it means our model start overfitting stage (Not increasing no of the row but try to learn more feature) in short n:m is not correct.
n = No of an independent feature.
m = No of row (total value for given feature).
So we can consider a good Model which has less difference between R-square and Adjusted R-square value.
For example :- Consider a model which has three independent features(Xn) CRIM, ZN, CHAS which predict the Y values and if we try to find the OLS values by using above python code it results in R-square value .380 or 38% which is close to zero, so it is not an optimal model, mean this model is not capable to predict dependent variable value Y with limited independent features of Xn. Data points are scattered and model regression line not able to capture efficiently.
This model is not best because the difference between R-square and Adjusted R-Square is more. So to get the best-predicted model we need to reduce the difference between R-square and Adjusted R-square to do so we need to increase the no of independent features.
So to increase the model efficiency we ask more features from a data source or otherwise artificially produce features(Xn). As we can see below screenshot R-square value increased from .380 to .936 or 93% which consider as a good model. Which mean most of the datapoint lay down near to regression line. In other words, This model is good because the difference between R-square and Adjusted R-square is less.
Wrapping up:- R-square value tell how well the regression model fits given data points in other hand Adjusted R-square value tells how features (X1, X2…Xn) important to your model.