The goal of splitting is to get homogenous groups.
How to define homogenous for a classification problem?
p1(1−p1)+p2(1−p2)
Entropy=−plog2p−(1−p)log2(1−p)
How to define homogenous for a regression problem?
SSE=Σi∈S1(yi−ˉy1)2+Σi∈S2(yi−ˉy2)2
The Gini impurity for the node “Gender” is the following weighted average of the above two scores:
35×518+25×0=16
The entropy of the split using variable “gender” can be calculated in three steps:
Information gain is biased towards choosing attributes with many values
Information Gain Ratio (IGR):
Gain Ratio=Information GainSplit Information
where split information is:
Split Information=−ΣCc=1pclog(pc) pc is the proportion of samples in category c.
Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.
Refer to this section of the book for more detail.
A single tree is unstable, and Bootstrap aggregation (Bagged) is an ensemble method that can effectively stabilize the model.
ˆfavg(x)=1BΣBb=1ˆfb(x)
To solve one of the disadvantage of bagging tree (i.e. correlation among these trees), random forest was introduced.
The ultimate commonly used none-deep-learning method that wins Kaggle competitions. Using sequence of weaker learner to build a strong learner.