Member-only story
Understanding C4.5 Decision tree algorithm
C4.5 algorithm is improvement over ID3 algorithm, where “C” is shows algorithm is written in C and 4.5 specifics version of algorithm. splitting criterion used by C4.5 is the normalized information gain (difference in entropy). The attribute with the highest normalized information gain is chosen to make the decision. The C4.5 algorithm then recurse on the partitioned sub lists.
In-depth Understand of algorithm:
This algorithm has a few base cases.
All the samples in the list belong to the same class. When this happens, it simply creates a leaf node for the decision tree saying to choose that class.
None of the features provide any information gain. In this case, C4.5 creates a decision node higher up the tree using the expected value of the class.
Instance of previously-unseen class encountered. Again, C4.5 creates a decision node higher up the tree using the expected value.
Steps in algorithm:
· Check for the above base cases.
· For each attribute a, find the normalized information gain ratio from splitting on a.
· Let a_best be the attribute with the highest normalized information gain.
· Create a decision node that splits on a_best.