OVERVIEW

Decision Trees (DTs) are a form of supervised learning algorithm utilized for classification and regression tasks. The term "decision trees" is used to describe a type of algorithm that imitates the way humans make decisions. This algorithm divides data into branches at specific decision points, resulting in a structure resembling a tree with various decisions and outcomes.

Understanding the Functioning of Decision Trees

1. Commencing at the Root: The algorithm initiates by considering the complete dataset located at the root node.

2. Splitting Criteria: The data is divided into subsets based on a feature that offers the most effective differentiation based on a specific criterion. This process is iteratively repeated, generating a separate branch for every split.

3. Terminal Nodes: Ultimately, this procedure culminates in terminal nodes, where each terminal node signifies a prediction (in classification) or a continuous value (in regression).

Criteria for Splitting: GINI, Entropy, and Information Gain

GINI Impurity:

GINI Impurity is a metric that quantifies the frequency at which a randomly selected element from a set would be misclassified if it were labeled randomly based on the distribution of labels within the subset. The CART algorithm utilizes GINI impurity. The objective is to reduce the GINI impurity.

Entropy:

Entropy is a metric that quantifies the level of randomness or disorder within a dataset. Within the framework of decision trees (DTs), a higher level of entropy signifies a greater diversity of classes within a given set, whereas a lower level of entropy suggests a more homogeneous set with predominantly one class.

Information Gain:

Information Gain is a measure that quantifies the difference in entropy before and after a split. Decision trees aim to maximize information gain by selecting the split that minimizes disorder in the resulting subsets, thus increasing the organization of our dataset.

An example of GINI or Entropy and Information Gain.

Let's examine a dataset that consists of two features and a binary target, which can be either "Yes" or "No". In order to determine a split, we evaluate the GINI or Entropy for each feature and calculate the Information Gain.

Assume that dividing based on Feature 1 leads to two groups: one with 2 instances labeled as Yes and 1 instance labeled as No, and another group with 1 instance labeled as Yes and 2 instances labeled as No.

When splitting by Feature 2, two distinct groups are formed: one group consisting of 3 instances labeled as Yes, and another group consisting of 3 instances labeled as No.

By calculating the Information Gain for each split, we can determine which feature is more effective in reducing uncertainty. The split is determined by selecting the option with the highest Information Gain.

Infinite Trees:

The concept of "Infinite Trees" refers to the idea of trees that have an unlimited number of branches and nodes, extending infinitely in all directions.

Theoretically, the creation of an infinite number of trees is possible due to the following reasons:

There are numerous methods to divide the data at each decision point, particularly when dealing with continuous features.
In the absence of limitations, a decision tree has the potential to generate branches for every distinct value in the dataset. This can result in a model that accurately matches the training data but is unable to make accurate predictions on new data (overfitting).

Training and prediction

1. Training: The decision tree algorithm undergoes a process of iterative data splitting during training. It selects the most optimal splits based on criteria such as GINI, Entropy, or other measures, until it reaches a stopping condition, such as a maximum depth or minimum number of samples per leaf.

2. Prediction Generation: In order to generate a prediction for a new instance, the algorithm traces the decision path determined by the splits from the root to a leaf node. The prediction corresponds to the label or value that occurs most frequently or has the highest average at the leaf node.

Visual examples of decision tree and entropy calculations:

1. A Simple Decision Tree Diagram:

2. Entropy and Information Gain Calculation:

DATA PREP

It's important to start with labeled data for supervised learning models like Naive Bayes. Every piece of data in the dataset needs to be linked to a label or class that the model can learn to guess. There are a few important steps that need to be taken to make sure the data is in the right format for training and testing the model.

Splitting the Dataset

The data set is split into two separate groups:

- Training Set: This part of the data is used to teach the model what to do. The model learns how the features and the target variable are related by using this data.

- Testing Set: This group is used to see how well the model's predictions match up with what we already know. This data wasn't shown to the model during the training phase, which helps us judge how well it works with new data it hasn't seen before.

Having separate training and testing sets is important for testing the model's ability to make predictions on data it has never seen before. This gives us an idea of how it might work in real life.

Data Transformation and Preprocessing

There were some steps we had to take before we could use the Multinomial Naive Bayes model on our dataset because it had both numerical and categorical variables. This model works best with features that show counts or frequencies.

1. Label Encoding: We turned categories into numbers for categorical variables so that the model could understand and use them.

2. Feature Binning: Numerical features were grouped into clear intervals. This transformation is especially helpful for fitting continuous data into models that need categorical input, such as Multinomial Naive Bayes.

3. One-Hot Encoding: One-hot encoding was used on the binned numerical features and categorical variables after binning. In this step, categorical variables are changed into a format that machine learning algorithms can use to make better predictions.

Making Sure Disjoint Sets Exist

To keep data from getting out, it's important that the training and testing sets are separate. When the model is accidentally exposed to the data it will be tested on, this is called data leakage. It causes performance metrics that are too optimistic and don't work well with new data. We carefully split our data into two sets, training and testing, so that they didn't overlap. When necessary, we used stratified sampling to make sure that the labels were spread out evenly across both sets.

RAW DATA

Link to Raw data: Raw Data

SAMPLE TRAINING DATA

Link to training data: Training Data

Sample testing data:

Link to Testing data: Testing Data

Python Code:

Link to Python Code.

RESULTS:

Upon observation, I found that my Decision Tree models accurately classified listings as either 'High-priced' or 'Low-priced' with a precision of 65.18%. This suggests that although my models are reasonably efficient, I could investigate additional optimizations to enhance their predictive capability.

Insights from the Confusion Matrix:

The confusion matrix generated by my models yielded the following insights:

The number of true positives (TP) is 6,840, which indicates the accurate identification of listings as 'High-priced'. It is comforting to observe that the models are successful in accurately identifying listings with higher prices.

The number of true negatives is 6,281. No information provided. In the same manner, the models accurately identified numerous listings that were categorized as 'Low-priced', effectively distinguishing them from the other pricing categories.

- The number of false positives (FP) is 3,533 and the number of false negatives (FN) is 3,475. No text provided. These figures depict the occurrences of misclassification. They identify particular regions where my models faced difficulties, possibly because of resemblances in the characteristics of certain listings across various price ranges.

INSIGHTS FROM ACCURACY PRECISION RECALL AND F1

A model performance is considered balanced when the precision, recall, and F1 score are all equal at 0.6518. My models are equally effective in accurately classifying 'High-priced' listings and avoiding misclassification of 'Low-priced' listings.

DECISION TREE VISUALIZATIONS:

Decision tree 1

In the first visualization of the simplified Decision Tree, I notice how the model uses features like room_type and city to make decisions. This tree, constrained to a depth of three, shows a balanced split between the classes and provides clear, understandable decision paths.

Decision tree 2

The second tree, with a minimum split requirement of 50 samples, provides a more comprehensive analysis of the interactions between features. This visualization demonstrates the intricacy of decisions made under specific conditions, which may have an impact on the model's performance.

Decision tree 3

The third tree exhibits a fusion of limitations, including a maximum depth and a minimum number of samples per leaf. The branches in this context demonstrate a more refined and sophisticated approach to decision-making, which has the potential to identify more intricate patterns between 'High-priced' and 'Low-priced' listings.

CONCLUSION:

I learned a lot about the things that affect listing prices from the results of the Decision Tree analysis I did for my project. The main things I learned are:

Feature Influence:

The type of room and its location (city) seem to have a big effect on the price. The decision trees used these features a lot at the top splits, which shows that they have a big effect on the price category of a listing. This information is very important for figuring out how to place a listing in the market or change how much to charge for it.

The complexity of the model:

The different levels of complexity in the decision trees taught me how important it is to find a balance between model depth and model simplicity. Deeper trees might be able to see more details, but they can also cause the model to fit too well. On the other hand, simpler trees might not catch all the details, but they can be used better with data they haven't seen before. To make a model that is both accurate and reliable, you need to find this balance.

Predictive Power:

The accuracy and recall scores were the same for all models, which means the model was balanced. For my project to work, this balance is very important. It means that the model can find true "High-priced" listings just as well as true "Low-priced" listings, without favoring one over the other.

It's better to be too tight or too loose:

There was a chance of both overfitting and underfitting in the confusion matrix and classification reports. The models were generally good at classifying, but there were some false positives and negatives that made me think I might need to improve the model even more to cut down on these mistakes.

The complexity of the data:

There were almost as many false positives as false negatives, which was another sign that the data was complicated. It showed that pricing is affected by many things, not just a few features. This means that to make the model better in the future, we may have to use more complex feature engineering or try using different modeling techniques.

Sharing the results:

The decision tree visualizations showed how important it is for machine learning models to be easy to understand. It is very helpful to be able to clearly explain the model's choices, especially when talking about the results with other people or using the model's outputs to help make business decisions.