CLUSTERING

OVERVIEW:

Clustering is an unsupervised learning method that groups data points based on their similarities. There are two primary forms of clustering:

Hierarchical Clustering:

Hierarchical clustering is a method used to group similar objects into clusters. The process can be either agglomerative, which is a bottom-up approach starting with individual elements and merging them into clusters, or divisive, which is a top-down approach splitting a single cluster into finer sub-clusters. The agglomerative method is more commonly used and involves initially treating each data point as a single cluster, then successively merging pairs of clusters until all points have been merged into a single remaining cluster. The process relies on a proximity matrix to measure the closeness of clusters and uses a variety of linkage criteria, such as the minimum, maximum, or average distances between clusters, to determine which clusters to merge.

The output of hierarchical clustering can be represented in a dendrogram, a tree-like diagram that illustrates the arrangement of the clusters produced by the process. The dendrogram provides a visual summary of the clustering process, showing the sequence of cluster mergings and the distance at which each merging occurred. This method does not require specifying the number of clusters beforehand, making it particularly useful for exploratory data analysis. Hierarchical clustering is widely applicable, from identifying natural groupings in genetics or linguistics research to market segmentation in business applications.

Partitional Clustering (K-means)

HEIRARCHIAL CLUSTERING vs PARTITIONAL CLUSTERING(K-means)

5 key differences between hierarchical clustering and k-means clustering:

Hierarchical clustering creates a hierarchy of clusters structured in a tree-formation based on distance, while k-means clustering partitions data into k non-overlapping distinct clusters.
Hierarchical clustering does not require pre-specifying the number of clusters upfront, while k-means requires defining the number of clusters in advance before running the algorithm.
Hierarchical clustering utilizes distance metrics and does not re-assign points, while k-means clustering re-assigns points between clusters and minimizes within-cluster variances from cluster means.
Hierarchical clustering has pros like visualizing the hierarchy but has scalability issues for large datasets, while k-means clustering is more scalable but lacks interpretability and requires more hyperparameter tuning.
Time complexity of hierarchical agglomerative clustering is O(n^2) making it more complex versus a time complexity of O(n) for k-means with iterative cluster assignment, making k-means more efficient for large datasets.

DISTANCE METRICS USED IN THESE CLUSTERINGS:

Hierarchical Clustering:

Hierarchical clustering utilizes measures of similarity or distance between data points to determine clusters. This dissimilarity between clusters is quantitatively represented using distance metrics like:

Euclidean distance
Manhattan distance
Cosine similarity
Jaccard Similarity

Euclidean distance is commonly used and calculates the geometric distance between each point (L2 norm). Mathematically, the Euclidean distance between two data points X and Y with features i to n is represented as:

d(X,Y) = sqrt(∑(xi - yi)^2)

Using this distance, hierarchical clustering merges the nearest clusters sequentially to build a hierarchy. The distance metrics directly determine how the clustering hierarchy is formed.

K-Means Clustering

In K-means, Euclidean distance is generally used to assign data points to their closest cluster centroid. The standard algorithm utilizes Euclidean distances between cluster centers and data points for cluster assignment and centroid updates.

Mathematically, a data point X is assigned to cluster j if its Euclidean distance to centroid μj <= μk for all other k centroids.

HOW CLUSTERING IS USED IN THIS PROJECT:

I plan to leverage clustering techniques for exploratory analysis on the Airbnb listings data. By finding patterns in how properties group based on attributes like location, room type, reviews etc, this can provide insight into key drivers of pricing and demand.

For example, k-means clustering the listings data into clusters based on number of reviews, availability, and location attributes can group into high/medium/low demand segments. Analyzing the clusters could reveal natural price ranges and seasonal effects for different property groupings.

Additionally, hierarchical clustering to find similarity groupings based on host attributes like experience, number of properties, cancellation policy etc. This can be used to classify pricing patterns and behaviors of hosts to better predict prices for new listings.

By incorporating cluster assignments, or using resultant centroids and densities as features, the unsupervised clustering output could help regression models better estimate current and forecast future daily/monthly prices. Features derived from clustering groups could capture latent characteristics.

In summary, using clusters and cluster-based features as inputs for predictive pricing models can help improve performance. The exploratory clustering analysis is aimed at providing data-driven insights to inform better future price predictions in a nonlinear way.

WHY ONLY NUMERIC DATA?

The key goal of clustering is to uncover intrinsic groupings within data in an unsupervised manner, without using any label information that could guide or influence the grouping. Techniques like k-means clustering work by minimizing within-cluster variance to cluster centers. As such, these algorithms rely completely on distances and similarities between data points in feature space to discover clusters.

Introducing any categorical features or labeled classes would disrupt the mathematical privation of clustering. For instance, k-means would not be able to compute cluster means and variances accurately. Elements like categories, strings, or pre-defined classes cannot be inherently factored into variance minimization or graphical distance metrics.

Therefore, effective data preprocessing for clustering involves fully numericizing the dataset - whether via label encoding, dummy variable creation, or feature binning. Common conventions also transform any skewed numeric attributes by scaling or normalization to limit bias from varying magnitudes.

The resulting dataset will contain numeric feature vectors for each observation, potentially identified by row ids or keys. But the algorithms strictly use similarities of unlabeled feature sets to generate the clusters. Any leakage of labels could greatly influence the outcomes rather than allowing the model to find natural groupings itself.

WHAT DATA I USED

The dataset originally contained both numeric and categorical columns, so categorical attributes were converted into numeric data by using one-hot encoding.

Here is a sample image of my Dataset before and after encoding:

Before encoding

Here you can see many columns with categorical variables

After Encoding

All the columns have beed transformed into numerical columns

LINKS TO DATASET:

Raw Dataset , Encoded Dataset.

Link To Python Code:

Python code for K-means clustering

Result:

The analysis of the Airbnb data through clustering provides insights into natural groupings within the listings. Hierarchical clustering results in a dendrogram that visually represents data clusters as branches of a tree, suggesting how many clusters could naturally exist by finding the longest vertical line without crossing any clusters (k in hclust). K-means clustering was applied with different values of k, providing a partitioned clustering, with each cluster represented by a centroid. Comparing the results from both methods can validate the structure of the data and inform the most meaningful way to organize the listings.

Hierarchial clustering

Illustrating with a Dendrogram:

The dendrogram from hierarchical clustering offers a visual interpretation of the data's structure. It shows the multi-level hierarchy where clusters have sub-clusters, all based on the Airbnb listings' similarities. The dendrogram is particularly useful for understanding the data's granularity and can suggest an optimal number of clusters by identifying significant 'jumps' in dissimilarity (height of the dendrogram links), which could indicate a natural partition.

Describing K-Means Clustering:

For k-means clustering, three different values of k were used to explore possible partitions: for example, k=2, k=3, and k=4. Each k value represents a different clustering solution, where listings are grouped based on their feature similarity. The choice of k affects the granularity of the clustering, with higher values of k creating more, but smaller, clusters. Clustering images for each k value would show how listings are distributed across different clusters, typically visualized in a 2D space for interpretability.

Visualizing Silhouette Results:

The silhouette method assesses the quality of the clusters formed by k-means. The silhouette score measures how similar an object is to its own cluster compared to other clusters. A higher silhouette score indicates better-defined clusters.

Silhouette for k=2:

This plot would illustrate the silhouette scores for a clustering solution where k=2. The silhouette score for each point indicates how similar it is to its own cluster compared to other clusters. A high average score would suggest that two clusters are appropriate, but if many points have low or negative scores, it might indicate that more clusters are needed to adequately describe the data.

Sillhouette for k=3:

The silhouette plot for k=3 would show whether adding another cluster improves the clarity of the division among listings. Ideally, each cluster will have silhouette scores that are predominantly positive and higher than those for k=2, suggesting better cluster cohesion and separation.

Silhouette for k=4:

The silhouette visualization for k=4 would reveal whether the additional cluster provides a more meaningful separation of the data or if it starts to fragment natural groupings, which would be evidenced by lower silhouette scores.

Average Sillhoutte score(for all points)

The "best k" would be indicated by the silhouette plot where the silhouette scores are consistently higher compared to other k values, and where there is a clear gap between the silhouette scores of different clusters. It suggests that the cluster k=3 is well separated and internally cohesive. This optimal k from silhouette analysis should be compared with the hierarchical clustering results to check for consistency or to explore reasons for any discrepancy.

Comparing heirarchial Clsutering and K-means

The Airbnb Clusters plot and the dendrogram show the results of two different types of clustering: hierarchical and k-means. When we put these results together, we can see how the Airbnb data is organized. The hierarchical clustering dendrogram shows how the listings are grouped into clusters with different distances between them. These distances show how similar the listings are. If you want to understand the dendrogram, look for big gaps in the trees that aren't crossed by lines. These gaps show where the natural boundaries of the data set or clusters are.

Inside the k-means plot, the lists are grouped around the centers, which are shown by "Xs." Each listing is put in the cluster that is closest to the middle. For the k-means algorithm, the number of centers is the same as the value k=4. This is a very important part of figuring out how finely the clustering is done.

To see how hierarchical and k-means clustering compare, we can look at how many groups each method suggests. There should be the same number of clear groups or centers in the dendrogram as there are in the k-means plot if there is a big gap that points to three clusters. You can also look at how these clusters are put together and compare them. If listings are close together in the k-means plot and also merge early in the dendrogram's hierarchy, then these two ways of grouping are very similar.

For the value of k=4, which is what hierarchical clustering says it is, look for the gaps in the above dendrogram. You can compare the number of clusters shown by the dendrogram to the number of clusters shown by k-means. It's more likely that the clustering solution will work if the number of clusters is the same. If the number of clusters is different, it could mean that the methods used for clustering are different (for example, linkage criteria for hierarchical clustering vs. centroid proximity for k-means) or that the data doesn't clearly show how the clusters are organized.

Since both clusterings gave us the same k value, we can say that the clustering solution is correct.

Plot of K-Means Clustering

Plot of hierarchial clustering

CONCLUSION:

By applying hierarchical and k-means clustering methods to Airbnb listings, we gain insights into how properties can be categorized based on different characteristics, which can relate to a variety of subjects depending on the analysis's focus. When discussing market segmentation, these techniques demonstrate how Airbnb properties can be classified into specific segments or categories, aiding in focused marketing and tailored customer service. It can provide hosts with insights into the competitive environment they are operating in and help them set prices and position their listings strategically. When discussing user experience, the analysis emphasizes the variety of features offered and how they meet various traveler requirements and preferences. It can assist Airbnb in enhancing their recommendation system to ensure users are paired with properties that align with their distinct preferences and needs. Clustering methods can unveil distribution patterns of properties in a city or region for urban planning or tourism studies, highlighting tourist hotspots and emerging attractions. This information is crucial for local authorities and tourism boards to effectively manage and develop infrastructure. Overall, the insights gained from clustering Airbnb listings involve comprehending the fundamental organization and dispersion of properties, which can influence various strategic decisions for hosts, the company, and city planners. It showcases how data analysis can reveal hidden insights, laying the groundwork for well-informed decision-making in the hospitality and tourism sectors.