## Profile

**If you want to build the strongest model, the selection of training data sets varies greatly**
We know that machine learning models can work efficiently, usually assuming that the training data, **validation data**, and test data have the same distribution. Therefore, when we take all the original training data sets to train the base learner, and output the predicted value of each base learner on the original training data set, it is used as the training data of the super learner. We also output the predictions of each base learner on the original test data set as the test data for the super-learner.
At this time, a problem arises: the training data of the super-learner comes from "the predicted value obtained when the base learner knows the label", and the test data of the super-learner comes from "the situation where the base learner does not know the label. Next, the obtained predicted value". Since the prediction mechanism of the base learner to generate training data and test data is different, it is equivalent to the data coming from different distributions, which may not be suitable for training the same model. Therefore, before building a super-learner, readers must first grasp the basic concepts of the training data set, in order to avoid misuse and affect the performance of the model.
Integrated Learning: Python in Practice! Integrating all technologies to create the strongest model" book not only introduces common integrated learning algorithms in practice, such as hard voting, soft voting, stacking method, self-help aggregation method, adaptive promotion method, gradient promotion method, random forest, extreme Random trees, etc. also provide very detailed concept explanations to help readers quickly grasp the theoretical basis and understand the implementation of various integrated learning technologies. The following is a summary of the key points of the book:
failed training dataset
We want to use metadata to train the super learner, so it is necessary to have metadata that can be used to predict labels. The intuitive idea is to use all the original training data sets to train the base learner, and output the prediction value of each base learner to the original training data set, and finally use the prediction values of these base learners and the labels of the original training data set to train the super learner. However, in general, this approach is not necessarily effective, because the super-learner can only see the strengths and weaknesses of the original data set that the base learner has already discovered, so the super-learner may only strengthen these known strengths and weaknesses. If we want to obtain more effective post-integration performance, we must avoid the above training methods.
It can be found that in step 3, when we use the base learner to predict the super-learner data set, we do not need to provide the label of the super-learner data set, so the base learner will make predictions without knowing the label, and use as training data for super-learners. The disadvantage of retaining data is that both the base learner dataset and the super-learner dataset are smaller than the original training dataset. If there are enough records, this method may be feasible. If there is too little data and many layers need to be stacked, the data will be insufficient and the error of the model will be too large.
In addition, assuming that there are N data sets for the base learner and M data sets for the super-learner, the number of features for each data is P, and the number of base learners is Q. Then the size of the base learner data set is a matrix with N columns and P columns, and the size of the super-learner data set is a matrix with M columns and Q columns.

Forum Role: Participant

Topics Started: 0

Replies Created: 0