Regression Based Predictive Machine Learning Model for Pervasive Data Analysis in Power Systems

░ ABSTRACT - The main aim of this paper is to highlight the benefits of Machine Learning in the power system applications. The regression-based machine learning model is used in this paper for predicting the power system analysis and Economic analysis results. In this paper, Predictive ML models for two modified IEEE 14-bus and IEEE-30 bus systems, integrated with renewable energy sources and reactive power compensative devices are proposed and developed with features that include an hour of the day, solar irradiation, wind velocity, dynamic grid price, and system load. An hour-wise input database for the model development is generated from monthly average data and hour-wise daily curves with normally distributed standard deviations. A very significant Validation technique (K Fold cross validation technique) is explained. Correlation between Input and output variable using spearman’s correlation analysis using Heat maps. Followed by the Multiple Linear Regression based Training and testing of the Modified IEEE 14 and IEEE30 Bus systems for base load case, 10% and 20% load increment with the 5-fold cross validation is also presented. Comparative analysis is performed to find the best fit ML Model for our research.

The process of transferring expertise into a skill or knowledge is defined as Learning. Humans are flawed when attempting to create a correlation between several variables and assessments. Machine learning techniques can solve some of these problems with higher accuracy and robustness [1]. This paper clearly explains Machine Learning Techniques, Data selection, and Feature Selection using Correlation and Machine Learning Model Development. In the Existing Techniques, the Programming concept is trained on the Input data and the program can get the output for the particular concept. While in Machine Learning, the Algorithm is trained with the Input data and its output data. The machine Learns the concept behind the Input and output through all the hidden layers and develops its logic from the Input and output data sets provided to the Machine. Now the Machine Learned logic is applied to the test system to check the effectiveness of the Machine Learning algorithm developed for our trained data [2][3]. When the Machine Learning techniques rely on the labeled information sets then it can be known as Supervised Learning. When the computational output referred to as dependent parameters are reliant up on the independent parameters, then the Prediction can be achieved using the Supervised Machine Learning Techniques. The algorithm is built on the training information data sets and after many iterations the algorithm becomes efficient. Regression and Classification are the two main types of Supervised Learning techniques [4][5].
By utilizing the training data set, Regression -Supervised Learning algorithm is used to forecast single value output using the training data set. The output value is always called the dependent variable, while the inputs are the independent variable. We have different types of regression in Supervised Learning, they are Linear Regression, Multiple Regression, and Polynomial Regression. In Linear regression, to predict the outcome only one input variable is applied. In the case of the Multiple Regression model, the prediction of output is done based on many inputs. In the case of Polynomial regression, the relationship between the input and the output variables is portrayed through graphical representation [6]. For example, the relationship between solar radiation and the time of the day. Both Linear and Polynomial Regression based Machine Learning Techniques are used. It is concluded that polynomial regression of order 4 was best suited for our problem. Regression based ML was opted over other artificial intelligence techniques like Neural Networks etc., because it provides a better control and leads to a better analysis of the correlation between the features and better feature selection.
The rest of the paper is organized in the following manner.       Linear Regression is an ML algorithm used for supervised learning. Linear regression performs the task to predict a dependent variable (target) based on the given independent variable(s). So, this regression technique finds out a linear relationship between a dependent variable and the other given independent variables

Predictive Techno Economic Analysis
The Input Output data set is generated by gathering the data from the NASA website, Analysis of Voltage Stability and Operating price in MATLAB coding environment, etc., The Input data set generated is checked for its feasibility using the load flow computations and the power rating of the RES devices. The ML model based on regression is selected for our analysis as the Regression model gives accurate statistical results. The multiple linear regression and polynomial regression (3 rd , 4 th, and 5 th order) are compared with the help of the K-Fold cross-validation technique. The best fit model is shortlisted and the predictive analysis is carried out for the 24 hrs. test data [7][8][9][10].
All-inclusive numerical information set for forecasting is developed in stages of three.
The algorithm that is proposed is based on the Zaragoza region in Spain case study. The wind velocity and solar irradiation data of that particular location are taken into consideration for input data implementation [11][12].  Solar irradiation data: The average solar irradiation data provided by Jayakumar et.al [13][14]. For each month for one year. This data is acquired from the National Aeronautics and Space Administration. Then we convert the monthly average data to an 8760hrdataset.

International Journal of Electrical and Electronics Research (IJEER)
 Wind velocity data: The monthly average wind velocity data is represented and they are acquired from the National Aeronautics and Space Administration. Then we convert the monthly average data to hourlydatafor8760hr. The Load demand, dynamic price, wind velocity, solar irradiation, and hour of the data are the Inputs taken into consideration for the prediction of voltage stability analysis and Operating cost of IEEE 14 and 30 bus data using ML techniques.
The obtained input database, which includes daily profile information of load demand, solar irradiation, power prices, and wind speed for the period 8760hr, is checked for feasibility to ensure the information is well within the variable's feasible range.
The data viability assessment of the electric grid load is evaluated by comparing the energy output with the rated capacity of the implemented farm and evaluating the convergence of power flow and renewable inputs. To estimate the system's price and voltage stability, a machine learningbased algorithm is constructed. The 2 output factors with in scrutiny are calculated for every hour for input information. As ML models are formed by training and testing a basic model with a large input-output dataset. One of the output parameters is the L-Index for Voltage stability Index and the other output parameter price of power consumed from the electric grid.

Development of Predictive ML Model
Two independent algorithms are created to attain maximum levels of accuracy in the forecasting of the two different output responses. Spearman's Correlation Coefficient: Spearman's correlation which can be also called "Spearman's Rank-Order Correlation" is applied when there is a lack of normally distributed nature within the variable set.
(1) A heat map provides a way to present data in a statistical graph format as an attractive and informative medium to impart information. Sea born is a Python library and it is based on mat plot lib. It is used for data visualization. A heat map is one of the items supported by seaborne where variation in related data is depicted with the help of a color palette. Figures 3-6 illustrate the heat maps that were created. Absolute correlation is represented by a value of +/-1, whereas no association is represented by a value of 0. From the heat maps plotted from the correlation analysis of Input features, it is seen that no 2 variables have a good correlation between them. As a result, all five recommended variables are taken into account while training and testing Machine Learning algorithms for predictive modeling. Considering a day's data, the hour of day constantly increases from 1 to 24, while both solar irradiation and wind speed increase and decrease during the period.

Training and Testing of ML Model Development
Using the acquired dataset, an ML model is constructed through attaining phase and testing. ML techniques for system design may be divided into two types: supervised learning and unsupervised learning algorithms. R Squared (R²) denotes the proportion of the variance for the dependent variable y that's described by the independent variable X. R² explains the how much extent the variance of one variable can explain the variance of the other variable. Hence, when the R² of a model is 0.75, then the model's features can explain approximately 75% of the observed variation. R² is computed by taking one minus the sum of squares of residuals divided by the total sum of squares.
(2) R 2 compares the compatibility of the proposed solution to a straight axis that serves as a reference. Once the proposed model works worse than a horizontal axis, R 2 is negative. Although the "square" is included, the R 2 formula allows it to get a negative number without breaking any math laws. R 2 is negative whenever the algorithm doesn't reflect the data trend and fits worse than a base line.
The first most important phase of ML algorithm construction is training, which involves feeding both input and result data to the ML algorithm to train and develop the system. The complete dataset is divided into two halves for training and testing, with the model being trained with 4 parts of the data and tested with the remaining 1 part of the data. The training data is used to create line are regression models and polynomial regression model so f orders 3,4 and 5. Transformation (normalization and polynomial transformation) and prediction are used to create polynomial regression models (using linear regression). The stream mechanism can be used to connect every one of these phases, reducing the computational burden.
Testing is the phase of ML model construction where the created model is validated or evaluated. As per the model structure, the remainder 20% of the testing data sources are being used to predict and estimate the relevant out comes. The root-mean-square error (RMSE) and R 2 scores are two metrics that can be used to evaluate the created models. The R 2 values represent the percentage of correct predictions that the developed model is capable of. A number of 1 indicates that the fit is flawless. As a result, R 2 values are used to assess the model's fitness using the testing data. Because it ensures that every observation from the original dataset has the chance of appearing in training and test set. This is one among the best approach if we have a limited input data.

Validation of the ML Model
Without checking the accuracy of model output, putting it to use and then relying on the results can be devastating. One may analyze and validate forecasts utilizing multiple strategies via ML model evaluation services. A cross-Validation is a crucial approach that experts employ frequently. The problem concerning machine learning techniques is that people won't understand how great they are until they are evaluated on a large dataset. Validation helps in estimating the performance of our model. K-Fold Cross Validation is one of the few types of crossvalidation.
1. K-Fold Cross Validation is a common form of validation set used in machine learning. The procedures for performing K-fold cross-validation are as follows: 2. The entire training data set is divided in to k equal subsets, each of which is referred to as a fold. Let's call the folds f1, f2,..., and fk. 3. From1to k, i=1to i=k  The remaining k-1 folds are preserved in the Crossvalidation training data set, whereas fold fi is kept in the Testing set.
 Our machine learning model is trained using a crossvalidation training set, and the model's accuracy is determined by verifying the anticipated out comes against the validation set.
In the k-fold validation data approach, all of the items in the original training data set are used for training and validation. Furthermore, each item is validated just once. The entire dataset was split into training and testing data in the ratio 4:1 (i.e., 80% Training and 20% Testing) for linear and polynomial regression without cross-validation. A similar ratio was intended to be maintained to ensure better comparative analysis and validation of ML models. Hence, the dataset was divided into 5 folds -4 of which will be for training and 1 will be for testing.
The total data is divided into 5-folds in this validation approach, and various sets of 4:1 train-test data are picked from the folds. In cross-validation, the combo that provides the greatest match is chosen. The final model is selected based on best fit in both single hold-out train-test split and cross-validation. In the next part, we'll go through the results and how we choose the bestfit model among the produced models.

░ 3. RESULTS AND DISCUSSION
The 8760hr database created for the research is used to construct all of the Machine learning for the prediction analysis. Various training methods such as polynomial regression, 5-fold cross- The machine learning model for the modified 6 bus system did not converge, i.e., the 1-year data taken for 6 bus system was insufficient to develop a reliable ML model. It will be considered in the future works. In table 3, R 2 values of the models developed for forecasting the voltage stability are given. The best models elected by comparing R 2 scores can be used to forecast the system's voltage stability for any futuristic input data. For forecasting cost of energy bought from the grid for the best-fit model is picked by making a comparison of the R 2 values of the developed models (multi-variable linear regression, polynomial regression, and 5-foldcross-validation). In

░ 4. CONCLUSION
In this paper, a detailed explanation of the Machine Learning Technique is given. The step-by-step methodology of the Prediction by Machine learning model is described. The analysis and selection of the developed ML model are also discussed in this paper. A very significant Validation technique (K Fold cross-validation technique) is explained. Correlation between Input and output variable using spearman's correlation analysis using Heat maps. Followed by the Multiple Linear Regression based Training and testing of the Modified IEEE 14 and IEEE30 Bus systems for base load case, 10% and 20% loadincrementwiththe5-foldcross-validation is also presented. Comparative analysis is performed to find the best fit ML Model for our research. This model can be applied for any larger power network with the availability of dataset (for considered network). Both MATLAB and Machine Learning codes are generalized, i.e., they can interchangeably be used for any system by replacing the input data in the developed MATLAB codes for dataset generation replacing the modified dataset to train the ML models.