Knime Data Mining Pdf Download __LINK__

In this paper, data cleaning was performed on the data set. The cleaning process involved the removal of unnecessary data points and the discarding of data that is invalid, erroneous, irrelevant, or does not fit the criteria of the dataset [41, 42]. Data sorting and extraction was achieved by the method of Tukey's bi-window box plot to estimate the median and the 25th to 75th percentile range of the data set [43]. This was to enable a visual representation of the data for identifying the outliers and identifying the data ranges with similar pattern to be considered in the data normalization process, whereas the data normalization is the process of converting the data set into a form that is fit for the learning algorithm. The data normalization process is essential as it is known that most of the learning algorithms are sensitive to data skew, which can be a result of the data transformation [44]. The data normalization process ensures that the data set is represented in a uniform manner for the learning algorithms, so that the algorithms can accurately apply the data without the presence of any bias. This is achieved by using the Box-Cox transformation technique to transform the data set so that the data is fit for the learning algorithm. The Box-Cox transformation method was chosen to accommodate for data outliers [45]. Data normalization was deployed to improve the data accuracy and to ensure the validity of the data set for the learning algorithms. This is important because the learning algorithm can only process the data of known accuracy and validity [46]. The data is considered valid if the outliers are identified and removed as they can skew the model performance [47]. The application of the Box-Cox transformation technique improves the overall data accuracy for the learning algorithm. The transformed data was coded for partitioning into sets and for that purpose, the Day of the Week, the Week, and the Month were used as categorical variables; and the data was partitioned into training and testing sets. The training set was used for training the data and the testing set was used to evaluate the performance of the classifier. The choice of the data partitioning was based on the size of the data set. The partition sizes are the same as the sample sizes, which is 50% of the data set by default, and this was further improved to 80% to provide more validity to the results. 827ec27edc