Best Regression Model
Problem statement -
In the attached dataset, there are six input (but not necessarily independent) variables:
X_1, X_2, and X_3 are boolean; X_4, X_5, and X_6 are positive real.
The noisy output variable is Y.
Your task is to build a regression model to predict the output Y based on the inputs {X_1 ... X_6}.
Choose a suitable modeling approach for the dataset.
Steps -
We check for correlations between the variables using matrix of correlations and Variance Inflation factor.
Remove the variables that are very highly correlated, and have least correlation with the dependent variable.
Run, multiple machine learning models on the data through a pipeline and calculate Residual Mean Squares for all of them. The One with least RSME will probably perform the best.
Implemented XGBoost, Random Forest, KNN for regression. KNN gave an accuracy of 98.72% which was calculated using K fold cross validation for each model.