Analysis of the CDC-NHANES Database to Identify Predictors Of Obesity in a Multiple Linear and Logistic Regression Model



What are the most important determinants of weight control and obesity prevention? These questions are more relevant now then they probably ever have been, as the obesity epidemic in our country has received a great deal of attention and for good reason. Obesity influences the development of many disease processes including Diabetes Mellitus, Metabolic syndrome, Hypertension, Dyslipidemia, Heart Disease, Cancer, Stroke, and Chronic Lung disease. These are the leading causes of death and disability here in the United States as in much of the developed world and they are all linked to obesity. The CDC reports that over half of U.S. adults are overweight BMI of 25-30 and 20.9 % of U.S. adults are obese as defined by a Body Mass Index (BMI) greater than 30kg/m. 1 2 3 Most tragically we are now seeing more type 2 diabetes (i.e. maturity onset diabetes mellitus) in children as the epidemic has extended into the young members of our society. 3 4 5  Researchers have recognized obesity as a pro-inflammatory state. There is even a theory that exists suggesting that inflammation causes obesity. 6 7  There is the contributory effect of genetics on developing obesity which lies outside of our direct control, however there is also the influence of lifestyle choices which are within our control. 8  Certainly exercise, diet, general activity level are all worthy of our attention as they are within our control and empower us to combat this deadly disease state i.e. obesity. Many individuals have claimed to have the answer from the popular low carbohydrate-low fat-high protein diets that our society is incessantly bombarded with to innovative exercise equipment, to weight loss supplements, and more invasive choices such as bariatric surgery. Even hormonal manipulation and other “anti-aging” interventions in order to increase basal metabolic rate, which varies from one individual to the other, and reduce the pro-inflammatory state that is believed associated with both obesity and aging. There are individuals who have tried one, some or all the above and are still plagued with poor self image, societal ostracizing, and poor health despite many of these hopeful interventions. The National Health and Nutrition Examination Survey (NHANES) database has made it possible to study a number of these lifestyle factors and their impact on obesity. The survey conducted through the Centers For Disease Control and Prevention (CDC) entails a comprehensive questionnaire, an expanded physical examination and an extensive array of laboratory biomarkers. The number studied here in the ongoing NHANES project includes approximately 6000 participants which makes it a very useful tool for the respective researcher to study a variety of predictor and outcome variables. 9  In this analysis it was this database that was selected to test a number of hypotheses regarding these lifestyle factors. For the purposes of this study the particular area of interest incorporates looking at the independent variables age, race, exercise, percent of protein and carbohydrates in the daily diet, and educational level, with interest in their relationship on the dependent variables BMI and obesity. For the details and logistics of how these variables were obtained through the database I would refer the reader to study the CDC’s extensive documentation on the details of the survey data collection methods employed. 9


The hypothesis for this analyses states that exercise activity should lower the risk of elevated BMI and obesity. This is well studied and widely supported in our society and among the medical community, and certainly worthy of study. Some may argue that for these very reasons it should not warrant further dispute. The effects of age may not be as clear, as obesity permeates all age groups, and as previously stated a more recently rising segment of the very young. Race is expected to be associated with obesity. For example, through its link to diabetes which is strongly associated with certain ethnic/racial groups including African Americans. Gender was also expected to show trends supporting a positive association based on findings in other reports. Carbohydrate and Protein percentage of total daily calories was selected for study because many of the diets employed, including respected dietary guidelines used through the American Diabetes Association lower the carbohydrate portion of the standard American dietary recommendations in favor of protein. 10   This makes a strong biological plausibility argument, as biochemically carbohydrates are converted into lipids through the liver. 11  Therefore our hypothesis will state that high-protein and low-carbohydrate diets should be negatively associated with the risk of obesity. Lastly educational level is hypothesized to correlate with obesity, which is a more prevalent problem in the developed as opposed to the developing world. This is largely the result of poverty. However in the U.S. diabetes has a higher correlation with minority groups who also have higher risk for poverty. Therefore, the affluency factor favors a positive correlation with BMI and obesity on an international scale, but also possibly favoring an inverse relationship to poverty on a U.S. national scale. Also considered was the role educational level would play in making an individual better aware of what factors contribute to poor health and to wellness and therefore it was feasible to consider that a negative relationship would be observed between educational level and the dependent variables BMI and obesity. We anticipated a direct correlation here in the U.S. between educational level/degree of affluency and with the “income level” independent variable that was among the competing choices in the national survey. Educational level was felt to be a good surrogate for income, and was included in hypothesis testing. Income was excluded from our analysis out of concern that collinearity could develop if it were included along with education. However it was felt this assumption, based on data showing degree of affluence increases with ones educational level, seemed a reasonable one to make for the purposes of this study. 12


The NHANES dataset was comprehensively searched for variables with the fewest numbers of missing data. Frequency and descriptive statistics were used through the SPSS statistical software program to analyze the variables prior to their inclusion in the model. 13  Only the carbohydrate and protein variables had a large number of missing values, 1125 each, from the six independent variables selected for study. These two variables warranted further analysis. A scatterplot was generated for percent carbohydrates of total calories against percent of protein of total calories. No significant discontinuity in the graph was noted. In fact it was further observed that many of the macro-nutritional variables available for study had the identical number of missing values equal to 1125. It is therefore very likely that these represented the same observations or subjects. The data was analyzed further and the missing responses paralleled both the macro-nutritional variables in question. However, further analysis of these missing observations and reasons for their not being included would be warranted. This is in order to determine whether bias was introduced by the exclusion of these missing values. The total sample size was 5962 subjects. The missing values would constitute 19% of the total observations for the nutritional variables selected. However their importance greatly warranted their inclusion in our analysis in order to answer the questions previously posed. The variables were entered and a multiple linear regression analysis was used on the BMI continuous dependent variable using the independent variables previously indicated above in order to generate the model. For comparison another obesity variable with the dichotomous outcome variable obese (yes or no) was also entered using the same six independent variables that were used with the BMI continuous dependent variable in the first linear regression model. All analyses were performed utilizing SPSS software. 13  The method selected for variable selection was the stepwise regression option with BMI as the dependent variable in the first analysis. Whereas the forward stepwise likelihood ratio method was used for analyses of the dichotomous obese dependent variable used in the logistic model. The output for the analysis is summarized below (Refer to Table 1 and 2 below).

Table 1: Linear Regression Models

Table 2: Logistic Regression Models

The stepwise variable selection method selected the variables percent of calories in protein (pcalprot), age, and race while not finding percent of calories in carbohydrates, educational level, or exercise to be statistically significant at the alpha 0.05 level for association with BMI for men. See Table 1 above. The method selected the variables percent of calories in protein (pcalprot), race, educational level, and exercise to be statistically significant at the alpha 0.05 level for association with BMI in women. The results were rather intriguing both supporting some of areas of the hypothesis while refuting other aspects of it. Percent of protein increased the risk of having a higher BMI although weakly, while age and race (belonging to a minority group) seemed to weakly decrease risk of a higher BMI in men. The race variable in women supported a positive relationship between high BMI and belonging to a minority group that was stronger than the association found in men. The women also had a negative association with high BMI and education, which supported our initial hypothesis albeit a weak relationship. Exercise in men was not selected as having a significant relationship whereas the relationship was weakly positive in women suggesting that exercise was associated with higher BMI. The strength of the associations were weaker than initially expected.

R-square values (coefficient of determination) were very small suggesting that most of the variance in the data could not be predicted by the model despite the statistical significance in coefficient parameters calculated. The R-square for the final model selected included 0.021 or 2.1% for males and was 0.061 or 6.1% for females. This signifies that the final model selected in men and women could explain only 2.1% and 6.1% of the variability in the data respectively. F ratio values were high and statistically significant in our multiple regression models suggesting that the model reliably predicts the data in both sexes although F ratio values were higher among females.

Collinearity was not apparent in our model, as the R matrix did not include r-values above our threshold of 0.8. The variance inflation factor (VIF) was below our threshold of 10 for all our variables in either model, tolerance was also high which further supported this conclusion, indicating no significant redundancy in the information provided by the other independent variables. All condition indexes were well below the threshold of 100 for both models further supporting a lack of significant multi-collinearity.

In reviewing the residual diagnostics no outliers could be clearly identified as the raw, standardized, studentized, and studentized deleted residuals did not demonstrate many significant outliers using a threshold of 2-3. For those that were identified Cooks distances were also below the threshold 1 which argues against any serious outliers. The leverage for our models was calculated to be 2(k+1)/n in the multiple linear regression model or 2(3+1)/2262=0.0035 in men and 2(4+1)/2530=0.0040 in women. Some of the leverage values exceeded our thresholds however they were not supported by high cooks distances or by correspondingly large residual values. The higher leverages also did not correspond to the suspicious residuals identified to be above threshold. This was expected when using large datasets, as it would be difficult to find significant outliers exerting undue influence on the remaining data points.

For the logistic regression models the following was observed. The likelihood ratio (analogous to the F-ration in the linear regression model) in the logistic model was also found to be high in both sexes. The Hosmer and Lemeshow goodness of fit test was not statistically significant. These findings support the assertion that the model is a good predictor for the data. Residual diagnostics suggested a megaphone appearance and a possible nonlinear relationship (see graphs below). Correlation diagnostics did not support either high Variance inflation factors or low tolerance above threshold. Raw, standardized, and studentized residuals were not above the threshold of 2-3 in either model. Cooks distances were not over the threshold of 1 in either model. Leverage values did not exceed a threshold of 2p/n in the logistic models generated. The graphs of the residuals also did not support the presence of significant influential outliers. Lastly in contrast to the linear regression model, the logistic stepwise regression model selected the same variables except that educat, and exerec were also selected in males while race was excluded. However the female variables selected were identical.

The Graphs of each of the independent variables was plotted against the dependent variable and there is suggestion of a megaphone pattern (refer to graphs below). This supports the possibility that a nonlinear relationship exists because it violates the assumption of homogeneity of variance. It may be that a non-linearity is present in the relationship between the independent and dependent variables that may serve as a better model for the data. To explore this possibility further each of the independent variables was plotted against the raw residuals in the linear regression model and against the standardized residual in the case of the logistic regression model. A normal probability plot of the expected against the observed dependent variable in the logistic model also suggested that the relationship might be nonlinear. Upon further analysis it would seem that a model transformation might be in order and might be more predictive of the data. A plot of BMI against percent body fat was also plotted to confirm what is a known relationship in order to the further support the NHANES data’s internal validity and reliability. Refer to the graphs below.

Image of Graph 1

Image of Graph 2

Image of Graph 3

Image of Graph 4 Graph 5 Image of Graph 6 Image of Graph 7 Image of Graph 8 Image of Graph 9 Image of Graph 10 Image of Graph 11 Image of Graph 12 Image of Graph 13

Several other options are also available and were considered including adding interaction terms although this would likely have created a problem with collinearity. When looking at our variables it is likely that among the 6 variables hypothesis tested the % protein portion of total calories and % carbohydrate portion of total calories would feasibly interact albeit in an inverse relationship. Since the increase of one would usually lower the fraction of the other as is the case in many of the popular diets circulating among the public. It is worth stating that variable pertaining to % lipid portion of total calories was not selected for the original analysis because it was felt that it would more highly correlate with carbohydrates as individuals who consume high fatty diets are also likely to consume very high glycemic index simple carbohydrates in general while also consuming and less of the more desirable lower glycemic index complex carbohydrates and high fiber. Educational level was also felt to interact with race, as lack of education correlates with poverty and as minority groups have a disproportionate representation among lower socioeconomic groups it was felt that these terms may also interact inversely as well. This justified inclusion of both these interaction terms. As has already been stated income level was not chosen for similar reasons to why lipids were not selected, as educational level was expected to correlate with the income variable.

Upon adding the product interaction terms (% protein)*(% carbohydrate) and (Race)*(educat4), the previous set of variables with the newly included interaction terms were analyzed and hypothesis tested. The following changes in the model summary statistics were observed. The linear regression model with respect to men selected the identical variables however the product interaction term race*educational level was included in lieu of the race variable selected in the first linear regression model only for the females. R-square values remained near identical to the first model. The parameter coefficients were all statistically significant including the interaction term. The F-ratio values remained relatively the same. The collinearity diagnostics did not support correlated independent variables based on the threshold previously discussed. Nor did variance inflation factors, tolerance or condition indexes support the presence of multicollinearity despite adding the interaction term to the female model. Although these values were not as favorable as in the earlier model selected. Also the coefficients were changed in the female model as the variable race*educational exhibited a positive association however the educational independent variable increased in the magnitude of it’s negative association. The coefficients in the male model did not change nor were any new interaction variables selected for the final model. The protein-carbohydrate interaction term was excluded. Residual diagnostics did not suggest any serious outliers exerting undue influence on our model. Therefore despite the addition of the interaction terms there was no significant improvement in our model.

In the logistics model, with respect to males, none of the variables that were initially selected were changed upon adding the interaction terms. That is, the interaction terms were excluded from the male model, however the Race*Educat interaction variable was included in the female logistic regression model in lieu of the race variable. This interaction was positively associated with the obese dependent variable.

The last modifications made to the models included an inverse-exponential transformation utilizing the newly defined dependent variables “lnBMI” and “lnobese” for the linear and logistic regression models respectively. The inverse of all the independent variables, including the interaction terms, was computed consistent with established model transformation methods. A stepwise regression analysis was again performed. The logistic analysis could not be executed secondary to the limitations imposed by the missing values. The linear regression model selected the same variables for males as in the previous analyses while the female model selected the same variables with one notable exception, the reciprocal race variable was included along with the interaction variable reciprocal race*educat. However, the coefficient of determination for this model decreased providing no further explanation for the random variability in the data than was explained by our initial model without the transformation.


Despite the seemingly counter-intuitiveness of some of the findings, several conclusions can be made. Among them being that the exercise variable had been worded to reflect the degree of exercise one gets while participating in recreation. A more explicit wording to reflect regular exercise using a combination of progressive resistance weight training and cardiovascular aerobic exercise, with the frequency that as is currently recommended, may have resulted in more accurate survey responses. Possibly resulting in the strong negative association, that would be expected, with the respective BMI and Obese variables in lieu of the weaker and converse association actually observed. The current wording may have reflected a more generous though inaccurate acknowledgement of participation in exercise on the part of the respondents than would have been observed had the question been worded differently. The carbohydrate variable was excluded from both models suggesting a lack of correlation in either direction, which was surprising, and the protein variable had the opposite relationship expected suggesting that these factors may not influence obesity to the degree expected. This fails to support the presumed effectiveness of the popular diets that utilizes the substitution of protein for carbohydrates as has been previously discussed. However, as discussed previously, there are other studies that conflict with these findings and support the opposite contention albeit with smaller sample sizes than the one used in this study. The question of macronutrient proportionality and weight reduction remains a very interesting area of research. Lastly educational level seemed to increase BMI and obesity in men but the risk in women was decreased. Race was found to have a stronger positive relationship with obesity and BMI among minority women over caucasian women but the same relationship was the converse among men.

It is difficult to conclude definitively, based on these findings, on what the role of exercise, protein and carbohydrate intake, race, age, or educational level have on BMI and obesity as there is conflicting evidence in other studies. However some of the patterns seen in our study suggest that further transformations could produce a more predictive model to explain the data as the analysis performed here seemed to support a nonlinear relationship graphically. It could very well be that many other interplaying factors were not addressed or the current factors studied here were possibly incorrectly measured in the NHANES survey as had been previously discussed. One clear conclusion is that the factors responsible may be far more complex than could be accounted for by the data and may include a stronger role for other missing factors not included in the analysis or in NHANES, although presence or absence of diabetes was included in the survey. This diabetes variable was not used in this analysis because of the high numbers of missing values. Other possibilities that would support further study to determine a more predictive model might include the influence of the genetic component, basal metabolic rate, and comorbidity disease state among the variables studied here. This in order to help illuminate the complex relationship surrounding obesity predictors. It is possible that including these additional variables might have improved our model making it more predictive of the obesity and BMI data in the CDC-NHANES observational survey data. These added variables combined with further model transformation to account for the nonlinearity in the data may well prove to be much more predictive. Our study included NHANES 2 data however NHANES 3 data has been released and reanalysis with the same methods used here on the new data may be also prove more predictive.


  1. The Centers For Disease Control and Prevention, department Of health and Human Services. Overweight and Obesity Health Consequences. Retrieved December 12, 2005 from the World Wide Web:

  2. Prevalence Of Overweight and Obesity Among Adults. National Health and Nutrition Examination Survey-NHANES Research Database. Centers For Disease Control and Prevention-CDC, National Center For Health Statistics. Hyattsville, MD, 2005. Retrieved December 12, 2005 from the World Wide Web:

  3. Surgeon generals Report: Overweight and Obesity Health Consequences. The primary concern of overweight and obesity is one of health and not appearance. Washington, D.C. U.S. Department of Health and Human Services. Retrieved December 12, 2005 from the World Wide Web:

  4. Prevalence Among Overweight Among Children and Adolescence:United States 1999-2002. National Health and Nutrition Examination Survey-NHANES Research Database. Centers For Disease Control and Prevention-CDC, National Center For Health Statistics. Hyattsville, MD, 2005. Retrieved December 12, 2005 from the World Wide Web:

  5. Grunbaum JA, Kann L, Kinchen SA, Williams B, Ross JG, Lowry R, et al. Youth Risk Behavior Surveillance-United States, 2001. Morbidity and Mortality Weekly Report. Surveillance summaries. 2002; 51: 1-62.

  6. Rogge MM, The case for an immunologic cause of obesity. Biol. Res. Nurs. 2002 July 4(1):43-53.

  7. Nicklas BJ, Ambrosius W, Messier MP, et al, Diet-induced weight loss, exercise and chronic inflammation in older, obese adults: a randomized controlled clinical trial. American Journal of Clinical Nutrition. 2004, April; 79(4): 544-551.

  8. Lyon HN, Hirschorn JN, Genetics of common forms of obesity: a brief overview. American journal of Clinical Nutrition, 2005, July; 82: 215-217

  9. National Health and Nutrition Examination Survey-NHANES Research Database. Centers For Disease Control and Prevention-CDC, National Center For Health Statistics. Hyattsville, MD, 2005.

  10. Yancy W. S., Olsen MK, Guton JR, et al; A Low-Carbohydrate, Ketogenic Diet versus a Low Fat Diet To Treat Obesity and Hyperlipidemia: A randomized, controlled trial. Annals of Internal Medicine. 2004, May 18; 140(10); 769-777.

  11. Michael W. King, Fatty acid Synthesis. The Medical Biochemistry Page. Retrieved December 12, 2005 from the World Wide Web:

  12. Gordon-Larsen P, Adair LS, Popkin BM. The relationship of ethnicity socioeconomic factors, and overweight in U.S. adolescents. Obesity Research 2003; 11:121-129.

  13. SPSS, Inc. SPSS for Windows, Release 12.0.0. [Software and documentation]. Chicago, IL: SPSS, Inc., 2004.