OVERVIEW
Decision Trees (DTs) are a form of supervised learning algorithm utilized for classification and regression tasks. The term "decision trees" is used to describe a type of algorithm that imitates the way humans make decisions. This algorithm divides data into branches at specific decision points, resulting in a structure resembling a tree with various decisions and outcomes.
Understanding the Functioning of Decision Trees
1. Commencing at the Root: The algorithm initiates by considering the complete dataset located at the root node.
2. Splitting Criteria: The data is divided into subsets based on a feature that offers the most effective differentiation based on a specific criterion. This process is iteratively repeated, generating a separate branch for every split.
3. Terminal Nodes: Ultimately, this procedure culminates in terminal nodes, where each terminal node signifies a prediction (in classification) or a continuous value (in regression).
Criteria for Splitting: GINI, Entropy, and Information Gain
GINI Impurity:
GINI Impurity is a metric that quantifies the frequency at which a randomly selected element from a set would be misclassified if it were labeled randomly based on the distribution of labels within the subset. The CART algorithm utilizes GINI impurity. The objective is to reduce the GINI impurity.
Entropy:
Entropy is a metric that quantifies the level of randomness or disorder within a dataset. Within the framework of decision trees (DTs), a higher level of entropy signifies a greater diversity of classes within a given set, whereas a lower level of entropy suggests a more homogeneous set with predominantly one class.
Information Gain:
Information Gain is a measure that quantifies the difference in entropy before and after a split. Decision trees aim to maximize information gain by selecting the split that minimizes disorder in the resulting subsets, thus increasing the organization of our dataset.
An example of GINI or Entropy and Information Gain.
Let's examine a dataset that consists of two features and a binary target, which can be either "Yes" or "No". In order to determine a split, we evaluate the GINI or Entropy for each feature and calculate the Information Gain.
Assume that dividing based on Feature 1 leads to two groups: one with 2 instances labeled as Yes and 1 instance labeled as No, and another group with 1 instance labeled as Yes and 2 instances labeled as No.
When splitting by Feature 2, two distinct groups are formed: one group consisting of 3 instances labeled as Yes, and another group consisting of 3 instances labeled as No.
By calculating the Information Gain for each split, we can determine which feature is more effective in reducing uncertainty. The split is determined by selecting the option with the highest Information Gain.
Infinite Trees:
The concept of "Infinite Trees" refers to the idea of trees that have an unlimited number of branches and nodes, extending infinitely in all directions.
Theoretically, the creation of an infinite number of trees is possible due to the following reasons:
There are numerous methods to divide the data at each decision point, particularly when dealing with continuous features.
In the absence of limitations, a decision tree has the potential to generate branches for every distinct value in the dataset. This can result in a model that accurately matches the training data but is unable to make accurate predictions on new data (overfitting).
Training and prediction
1. Training: The decision tree algorithm undergoes a process of iterative data splitting during training. It selects the most optimal splits based on criteria such as GINI, Entropy, or other measures, until it reaches a stopping condition, such as a maximum depth or minimum number of samples per leaf.
2. Prediction Generation: In order to generate a prediction for a new instance, the algorithm traces the decision path determined by the splits from the root to a leaf node. The prediction corresponds to the label or value that occurs most frequently or has the highest average at the leaf node.
Visual examples of decision tree and entropy calculations: