Hui Lin and Ming Li
The goal of splitting is to get homogenous groups.
How to define homogenous for a classification problem?
\[p_{1}(1-p_{1})+p_{2}(1-p_{2})\]
\[Entropy=-plog_{2}p-(1-p)log_{2}(1-p)\]
How to define homogenous for a regression problem?
\[SSE=\Sigma_{i\in S_{1}}(y_{i}-\bar{y}_{1})^{2}+\Sigma_{i\in S_{2}}(y_{i}-\bar{y}_{2})^{2}\]
The Gini impurity for the node “Gender” is the following weighted average of the above two scores:
\[\frac{3}{5}\times\frac{5}{18}+\frac{2}{5}\times 0=\frac{1}{6}\]
The entropy of the split using variable “gender” can be calculated in three steps:
Information gain is biased towards choosing attributes with many values
Information Gain Ratio (IGR):
\[Gain\ Ratio = \frac{Information\ Gain}{Split\ Information}\]
where split information is:
\[Split\ Information = -\Sigma_{c = 1}^{C}p_clog(p_c)\] \(p_c\) is the proportion of samples in category \(c\).
Pruning is the process that reduces the size of decision trees. It reduces the risk of overfitting by limiting the size of the tree or removing sections of the tree that provide little power.
Refer to this section of the book for more detail.
A single tree is unstable, and Bootstrap aggregation (Bagged) is an ensemble method that can effectively stabilize the model.
\[\hat{f}_{avg}(x)=\frac{1}{B}\Sigma^B_{b=1}\hat{f}^b(x)\]
To solve one of the disadvantage of bagging tree (i.e. correlation among these trees), random forest was introduced.
The ultimate commonly used none-deep-learning method that wins Kaggle competitions. Using sequence of weaker learner to build a strong learner.