Feature Selection with Decision Tree: Obesity Level Analysis

Since the best classifer for the dataset was Decision Tree, we will use the Decision Tree classifier and evaluate it with the full dataset and each age-group dataset. Feature selection will be performed to obtain the best features from the classification. The top features will determine which features ultimately affect obesity levels the most. In addition, performing the classification on the different age groups will allow us to compare and contrast to see if a certain obesity level based on certain attributes effects a certain age group over another.

Decision Tree and Feature Selection with Full Dataset:

Above, the Decision Tree classifer performed well on the full dataset, accurately classifying the classes at around 94.1% acurracy. Class 4: 'Obesity_Type_III' had a 100% accurate prediction. Class 0: 'Insufficient_Weight' and 3: 'Obesity_Type_II' achieved above 95% accuracy. Class 6: 'Obesity_Type_II' had the lowest accuracy at 91%. Below we calculate the accuracy for both the test and the training sets. The accuracy for the training set is 100% and the accuracy for the test set is 94.09%. The model is performing well and not overfitting since the accuracy for the test set is very close to the training set and not experiencing high variance.

Above with the feature selection, using the top 15% of features, resulted in the classifier still being able to predict at an accuracy of 86.3%. Although, the accuracy reduced from the original feature set, the reduced feature set contains only seven features and still achieved a high level of accuracy. Class 1: 'Normal Weight' and Class 6: 'Overweight_level II' had the lowest accuracy score at 74% and 75% respectively. Class 4: 'Obesity Type III' achieved 100% accuracy and Class 3: 'Obesity Type II' still maintained over 95% accuracy. Moreover, for the full dataset, the top features that are associated to obesity levels is age, weight, gender, family history, FCVC and CAEC. Male and female gender as attributes are features that are salient to classifying obesity levels. This was seen during the cluster exploration which split the data into two clusters representing male and female genders. In addition to age and weight, family history, specifically with individuals indicating no history of obesity in their family is also an important feature when classifying obesity levels. This shows that hereditary, family, or environmental factors associated with families with a history of obesity, plays a role in an individuals obesity levels. Lastly, two eating habit features, always eating vegetables with meals (FCVC) and frequently eating food between meals round up the top features. Moreover, with the full dataset, physical activity features did were not included in the top 15% of features and instead, biological factors and eating habits were features that had more precedents in determining obesity levels.

Decision Tree and Feature Selection with Gen-Z Dataset:

Above, the Decision Tree classifer for the Gen-Z dataset performed well with accuracy slightly lower than the full dataset at 91.9%. Class 0: 'Insufficient_Weight' and 2: 'Obesity_Type_I' achieved above 95% accuracy. Class 0: 'Insufficient_Weight' and 2: 'Obesity_Type_I' achieved above 95% accuracy. Class 3: 'Obesity_Type_II' had the lowest accuracy at 84%. Class 1: 'Normal_Weight' and 6: 'Overweight_Level_II' had the next lowest accuracy at 88% and 89% respectively. Moreover, for Gen-Z dataset, the model performed better in prediction with Class 0: 'Insufficient_Weight' and 2: 'Obesity_Type_I'. Both the full dataset and the Gen-Z dataset had lowest accuracy with Class 6: 'Overweight_Level_II'.

With the feature selection above, using the top 15% of features, resulted in the classifier dropping in accuracy to 80.8%. The model does not perform as well as the model using the full dataset. Class 4 had the highest accuracy at 98%, which is comparable to the full dataset which predicted class 4 at 100%. Class 6: 'Overweight_Level II' had the lowest accuracy at 58%. This shows that for the Gen-Z age group, the model is unable to classify 'Overweight_Level II' using the top 15% of features. Likely, this means that other attributes are required to accurately classify this obesity level. The model also does not classify Class 1: 'Normal_Weight' or Class 3: 'Obesity_Type_II' as well as the other classes. This is similar to the model using the full dataset which also had a lower accuracy level for Class 1: 'Normal_Weight' compared to other classes.

The top 15% of features includes weight, gender, family history with obesity, always eating vegetables with meals (FCVC), and frequently eating food between meals. These features are the same top features from the model using the full dataset except, for gender only male gender is included. Only male gender is a top feature for the Gen-Z dataset which is interesting since both genders were included in the full dataset. Two additional eating habits features are included in the top features with the Gen-Z age group: not eating high calorie foods frequently and number of meals consumed daily. Moreover, eating habit features are the most important features in association with obesity level for Gen-Z age group along with biological and hereditary features. Similar to the model in the full dataset, physical activity features were not included in the top features for the classification of obesity levels.

Decision Tree and Feature Selection with Millenials Dataset:

Above, the Decision Tree classifer for the Millenials dataset did not perform as well as the model for the Gen-Z or the full dataset. The model achieved an accuracy of 89.6%. Similar to the two previous models, Class 4: 'Obesity_Type_III' had a prediction accuracy of 100%. Unlike the two previous models, Class 6: 'Overweight_Level II' and Class 3: 'Obesity_Type_II' performed better in this model with an accuracy of 95%. Class 0: 'Insufficient Weight' had an accuracy of 67%, which is starkly lower in accuracy compared to the previous two models! Class 1: 'NormalWeight' also had a low accuracy at 71%. This aligns with the two previous models, which also had the lowest accuracy in predicting Class 1: 'Normal Weight'.

With the feature selection above, using the top 15% of features, resulted in the classifier dropping in accuracy to 79.9%. The model performs slightly worse than the model for Gen-Z age group. Class 4 again, had the highest accuracy at 100%, which is comparable to the full dataset which also predicted class 4 at 100%. Class 2: 'Obesity_Type_I' and Class 5: 'Overweight_Level_I' had the lowest accuracy at 65%. Class 6:'Overweight_Level_II' has a significant drop in accuracy, which prior to feature selection had a 95% prediction, and after feature selection has a 68% prediction. This shows that the features necessarily to predict Class 6 are not included in the top 15% features. The model also does not classify Class 1: 'Normal_Weight' as well as the other classes, which is consistent pattern among all the models. In contrast, the model was able to predict Class 3: 'Obesity_Type_II' better than the model for the Gen-Z age group.

The top 15% of features includes weight, gender both male and female, family history with obesity, always eating vegetables with meals (FCVC), and frequently eating food between meals. These features are the same top features from the model using the full dataset. Unlike the previous two models, this model includes one additional top feature, a physical activity feature, means of transportation as automobile. This is interesting since previous models did not include a physical activity feature. Moreover, the model with the top 15% features for both the Millennials age group and the Gen-Z age group yielded similar accuracy for classification. The main difference is that a physical activity feature is included in the top features for Millennials which is not included for Gen-Z.

Decision Tree and Feature Selection with Gen-X & Boomers Dataset:

Above, the Decision Tree classifer for the Gen-X and Boomers dataset performed the worse compared to all previous models. The model achieved an accuracy of 66.7%. This model resulted in the lowest accuracy score compared to the previous models. This dataset is significantly smaller than the previous two dataset. As such, not all classes are represented in this model and due to the limited number of entries, the model does not have as much data for the classifier to train on compared to previous three models. This model was able to predict Class 1: 'Normal_weight' at 89% accuracy, which is higher in accuracy compared to all previous models. This model was unable to predict Class 0: 'Insufficient_Weight or Class 3: 'Obesity_Type_II'.

With the feature selection above, using the top 15% of features, resulted in the classifier dropping in accuracy to 66.7%. This model underperformed compared to all previous models with all classes having accuracy scores of 75% or lower. Again, the model was unable to predict Class 0: ‘Insufficient_Weight.’ Since some classes are not represented in this dataset and with a lower amount of data for training, it is not unexpected that the model was unable to classify obesity levels as well as the previous models.

The top 15% of features includes weight and always eating vegetables with meals (FCVC) which are two features also included as top features for the full dataset, Gen-Z dataset, and Millennial’s dataset. Additional eating habits features are included as top features: water intake at more than 2 liters per day and monitoring calories intake daily. In addition, physical activity features include direct physical activity 1 to 2 days or 3 to 4 days and means of transportation by public transit. This is interesting since previous models did not include specific eating habit features such as water intake and direct exercise or direct physical activity. The results are drastically different from the Gen-Z and Gen-X dataset but since the sample size is significantly lower, more data would be needed for this population to perform a more detailed and thorough analysis in validating these top features and determining what key features affect the classification of obesity for the Gen-X and Boomers age group

Comparsion of Results:

The model with the best accuracy from the Decision Tree classifier is the full dataset. The top 15% features for this model include age, weight, gender, family history with obesity, always eating vegetables with meals (FCVC) and frequently eating food between meals (CAEC). With these top features, the model still performed well with an accuracy of 86.3%. Biological features and family history with obesity are top features that are associated with classifying obesity. With the full dataset, only two additional eating habit features were top features. The models for Gen-Z age group and Millenials age group also included weight, gender, family history with obesity, always eating vegetables with meals (FCVC), and frequently eating food between meals (CAEC) as top features. Gen-Z includes more eating habit features including not eating high calorie foods frequently and number of meals consumed daily and Millenials includes a physical activity feature which is transportation by automobile. The accuracy for the the Gen-Z model is 91.8% whereas the accuracy for the Millenails model is 89.5%. When evaluating the performance with th top 15% features, the Millenials model performs slighly better at 80.6% wheras the Gen-Z model had an accuracy of 80.1%. The model for the Gen-X and Boomers age group performed the worst at an accuracy of 77.8%. Gen-X and Boomers had different top features compared to all other models. The top features still included weight, but no longer included gender and instead includes both eating habits and physical activity features including water intake of 2 liters or more, calories intake daily, and direct physical activity. The model had significantly lower amount of data compared to previous models which may contribute to the lower accuracy and the low performance of the model.

In conclusion, Gen-Z age group is over represented in the full datset compared to Millenials and Gen-X and Boomers. The classifier model performed better on the full dataset. The classifer model equally performed well on the Gen-Z and Millenials datset. By looking at the top features and evaluating the models using the top features, we can see which features are most important in classifying obesity levels. In this case, for the Gen-Z and Millenials age group, biological and hirediary features are more associated with obesity levels, with eating habits as additional top features specifically eating vegetables with meals and eating between meals. Gender appears to play a role with Gen-Z and Millenials age group. This is expected due to biological factors such as difference in weight, height, and calorie intake. For Gen-X and Boomers age group, weight and direct eating habits such as water intake, calorie intake, direct physical activity, and mode of transportation are top features. More data is needed for Gen-X and Boomers to be able to analysis and evaluate the models and determine which features affect classification of obesity levels best.