Preprocessing and Clustering Analysis Exploration: Obesity Level Analysis

Cluster analysis using K-Means algorithm is performed on the full dataset. To perform the cluster analysis, first, the class label, NObeyesdad is removed from the dataset. The dataset is transformed to ensure the correct data types exist for each feature. Dummy variables are created for the categorical features. The numeric dataset contains 2,111 rows and 43 columns.

Clustering Exporation with K-Means:

Below is the exploration of clustering using K-means with the normalized data. Various values of k were tested and the centroids were evaluated to determine if a pattern appears in the clusters based on the data. For each value of K, the cluster centroids were examined to determine if any pattern exists in the data. A silhouette analysis is performed for to evaluate the separation between the resulting clusters and determine the quality of the clusters. The silhouette plots display a measure of how close each point in one cluster is to points in the neighboring clusters. The mean silhouette value is calculated and used as a threshold when determining the cluster quality. Clusters with most of their coefficients above the mean silhouette value are considered better quality which means that clusters are further away from the neighboring clusters. Clusters with most of their coefficients below the mean silhouette value reveals that samples are very close to the decision boundary between two neighboring clusters and negative coefficient values indicate that samples are assigned to the wrong cluster. When the silhouette plot does not display any negative coefficients and have the thickest plots visually above the silhouette mean, the correct number of K has been selected.

Above, the plot of the silhouettes shows that cluster 0 outperformed the other clusters with all its coefficients above the mean silhouette value. Cluster 4 also performed well with many of its coefficients above the mean silhouette value. The remaining three clusters did not perform as well since most of their coefficients are below the mean silhouette value. Four of the clusters display negative values with cluster 3 having the most negative coefficients, which indicates that 5 clusters are too high for the dataset.

Above, shows the results of the silhouette analysis for K=3, which reveals that the algorithm performed neither better nor worse than at K = 5. The plot of the silhouettes shows that cluster 2 outperformed the other clusters with all its coefficients above the mean silhouette value. Cluster 1 performed the worst and did not have any coefficients above the mean silhouette value, but instead has negative coefficients. When evaluating the centroids, cluster 0 has Gender_Male with a value of 1.00 and Gender_Female with a value of 0. Cluster 0 most likely represents the male gender. Cluster 1 and 2 both contain a value of 0.99 for Gender_Female and 0.01 for Gender_Male, which shows that most likely Cluster 1 is misclassified. Most likely this cluster is pulling coefficients where it should not be and is too close to cluster 0 to be its own cluster. We can conclude from the silhouette plots that likely three cluster is still too high and that two clusters may be sufficient.

above shows the results of the silhouette analysis for K=2, which achieved the best silhouette plot compared to previous plots at K = 5 and K = 3. This silhouette plot shows that both cluster 0 and 1 have coefficients that are above the mean silhouette value and none of the coefficients are negative. Both clusters are neither thick nor full, although, cluster 0 appears thicker than cluster 1, but from the clustering results above, this result is most successful. When looking at the centroids, the two features that stand out that most likely represent the clusters compared to all other features is Gender_Male and Gender_Female. In cluster 0, Gender_Male has a value of 1.00 while Gender_Female has a value of -0.00 and in cluster 1, Gender_Female has a value of 1.00 while Gender_Male has a value 0.00. Moreover, we can conclude from the silhouette plots above that likely, cluster 0 represents males and cluster 1 represents female. This evaluation shows that a pattern exists by gender and that gender may play a role in the dataset and in determining classification of obesity levels.

Next, we will create age groups and seperate the age of each individual based on generation. Exploring age groups will allow us to re-evaluate the clusters and determine if a pattern exists also within age group for classification.

Discretize the Age attribute into 4 seperate age groups and re-run K-Means Clustering:

Gen-Z (1997 – 2012), Age: 9 – 24

Millennials (1981 – 1996), Age: 25 – 40

Gen-X (1965 – 1980), Age: 41 – 56

Boomers (1955 - 1964), Age: 57 - 66

K-means algorithm with the three generational age groups: Gen-Z, Millennials, and Gen-X and Boomers. This exploration is being explored to see if a pattern exists based on age range which the cluster analysis for the full dataset did not evaluate since the age groups were not grouped into categories. The youngest age is 14 and the oldest age is 61. The age groups are created by binning the Age attribute and then transforming the age group attribute into dummy variables. For exploratory purposes, K-means is performed on the dataset first without min-max normalization and second with min-max normalization at K = 3.

The results of cluster analysis without normalization shows a very healthy silhouette plot with all three clusters full, thick, and with coefficients above the mean silhouette value. Figure 2.1 below confirms that clusters when age is grouped by range. When looking at the centroids, cluster 2 shows Gen-Z at 0.9 while Millennials at .10 and Gen-X and Boomers at 0.00. Most likely Gen-Z is represented in cluster 2.

The completeness and homogeneity scores were calculated for clusters since the class labels exist for further examination of the cluster quality. The completeness score was 0.70 which shows that members of a given class are assigned to the same cluster 70% of the time. The completeness score is positive and confirms that the clusters captured most of one class. The homogeneity score was much lower at 0.39 which shows that the clusters are not pure. These results may indicate that age group may be a factor in deciding the clusters for the data, but it may not be the main factor that affects obesity level for classification. The silhouette plots above display that a pattern exist but we must take into consideration that the data was not scaled. As such, we will next, perform K-means again with the data normalized to validate the results.

Perform K-Means with Normalized Data on Age Groups for Comparsion:

Above, the results are drastically different from the results from the non-normalized data. Cluster 0 outperformed all other clusters with all its coefficients above the mean silhouette value. Cluster 2 performed adequately with many of its coefficients above the mean silhouette value and only a few of its coefficients in negative. Cluster 1 did not perform as well as many of the coefficients are in negative and none of them are above the mean silhouette value. When looking at the centroids, the values of the age group do not directly correspond to the silhouette plots.

These results show that with the normalized data, a pattern may not necessarily appear in the age groups. Moreover, when examining K-means and clustering, we can see how not scaling the data may lead to conclusions or patterns about the data when a pattern may not necessarily exist. This is validated when evaluating the completeness and homogeneity scores, which both resulted in low scores. The completeness score was around 0.34 and the homogeneity score is lower at 0.18. These scores show that grouping by age is not the main determining factor for the classification of obesity levels. Age still may play a role as a key feature, but the clustering exploration does not necessary reveal that the age groupings have a significant pattern. By building the classification models and performing feature selection, we will be able to obtain a better picture of age and age groupings and their role in classifying obesity levels.


Save Output of Data-Set (non-normalized) based on Age-Groups for Classifier Use: