Correlation (Coefficient, Partial, and Spearman Rank) and Regression Analysis
Summary / Explanation
Correlation and regression analysis are fundamental statistical techniques used to explore relationships between variables. Correlation analysis helps identify the strength and direction of association between 2 or more variables. In contrast, regression analysis predicts and understands the relationship between a dependent variable and 1 or more independent variables. These methods provide crucial insights into the patterns and interactions within data, aiding in decision-making across various fields.
Correlation Analysis
In a statistical context, correlation is a broad assessment of the possibility and strength of association between 2 variables. The choice of correlation analysis depends on the type of data that is available. This topic covers methods to assess correlation for 2 or more continuous or ranked variables.[1][2]
Interval or continuous data
- Pearson correlation (r): The most commonly used type of correlation analysis is the Pearson correlation, often denoted as Pearson's r. This method is appropriate for comparing variables of the continuous variable type, specifically interval/ratio data, such as scores on a test ranging from 0 to 100.[1][2] Pearson's r analysis returns a value between −1 and 1, where 0 indicates no relationship. As the value approaches either end of 0, a stronger negative (towards −1) or positive (towards +1) linear relationship between the 2 variables is observed.[1]
- Partial correlation (ρ): Partial correlation measures the linear relationship between 2 continuous variables while controlling for other continuous variables.[3] As with Pearson correlations, partial correlations also fall within the range of −1 and +1.
Ordinal or ranked data
- Spearman rank correlation [ρ (rho) or r]: This measures the strength and direction of the association between 2 ranked variables.[4] For ordinal, ranked, or ordered data—data that have a defined order to them, such as when treatments are ranked in effectiveness from 1 to 5—the Spearman rank correlation is used.[1] This type of correlation analysis is also known as Spearman rank order correlation or Spearman rank correlation coefficient. Like Pearson and partial correlations, Spearman rank correlation values range between −1 and +1.[1]
Regression Analysis
The regression analysis determines whether a relationship exists between a dependent variable and 1 or more independent variables.[1][5] Unlike correlation, regression analysis is employed for prediction and establishing causality.[2]
- Independent variable definition: Independent variables, as the name implies, are variables in the relationship between 2 or more independent variables, meaning their measurement does not depend on the other variables.[1][5][2] For example, common independent variables include demographic factors such as gender and race or ethnicity. These independent factors or variables are generally considered immutable and, thus, do not change based on the other independent variables in the relationship.
- Dependent variable definition: Conversely, the dependent variable is the variable in the relationship influenced by the other variable(s), including the independent variable and possible confounding variables—those that have a relationship with both the independent and dependent variables or control variables.[1][2]
Types of regression: Similar to the discussion on correlation above, different types of regression analysis are used when different variable types are present.
- Simple linear regression: Linear regression is used when a dependent variable data type is continuous, such as weight data or length of stay, and there is 1 dependent variable and 1 continuous independent variable.[1]
- Multivariable linear regression: In multivariable linear regression, there is 1 continuous dependent variable but many continuous or categorical independent variables.
- Logistic regression: Logistic regression is used when the dependent variable data type is categorical, encompassing either nominal or ordinal data, where data are grouped into mutually exclusive categories, such as blood type and zip code), or dichotomous, having only 2 possible values, such as whether a patient smokes (yes or no), or whether the disease is present or absent.[1] The independent variable can be continuous or categorical. For example, in comparing 2 drugs (independent) regarding A1c (continuous independent variable) and diabetic improvement, the outcome is whether improvement is achieved (yes or no, categorical dependent).
Correlation and Regression Statistical Output Qualifiers
Statistical significance: Statistical significance, or additional context by which statistical output is evaluated, should always be included in the presentation of any correlation or regression analysis output.[6] Statistical significance is commonly assessed through the P value. The American Statistical Association defines P value as "the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value."[7] However, relying solely on P values for interpretation can lead to incorrect conclusions, as it is prone to various issues.[8][9][10][11][12][13]
Clinical significance: Clinical significance determines the degree to which a treatment or therapy has a practical and substantial effect on the patient.[10]Although a treatment may be statistically significant, assessing whether it has a clinically substantial impact on patient outcomes is essential.
Effect size is a valuable metric for indicating clinical significance. Effect size quantifies the magnitude of the difference or the strength of the association between variables.[14] Instead of relying solely on P values to determine significance, it is encouraged to use a combination of P values and effect size to more fully evaluate whether results might be significant, both statistically significant (P value) and in terms of the difference between variables of interest (effect size).[8]
Effect size is expressed as the correlation coefficient (r) for correlation. The effect size is expressed as the coefficient of determination (R2) in simple linear regression.[14] In multivariable linear regression, the effect sizes are the partial eta-squared (η2p). For logistic regression, the effect size is the odds ratio for the simple or adjusted odds ratio for the multivariable.
Issues of Concern
The purposes of employing correlation and regression analysis are distinct. Correlation determines the strength and direction of a possible linear relationship between 2 variables. In contrast, regression analysis is used to estimate parameters for a linear equation to predict the values of a variable relative to the other or to explore an association between the outcome and the independent variables while controlling for the effects of the different variables.[1]
Register For Free And Read The Full Article
- Search engine and full access to all medical articles
- 10 free questions in your specialty
- Free CME/CE Activities
- Free daily question in your email
- Save favorite articles to your dashboard
- Emails offering discounts
Learn more about a Subscription to StatPearls Point-of-Care
References
Bewick V, Cheek L, Ball J. Statistics review 7: Correlation and regression. Critical care (London, England). 2003 Dec:7(6):451-9 [PubMed PMID: 14624685]
Mata DA, Milner DA Jr. Statistical Methods in Experimental Pathology: A Review and Primer. The American journal of pathology. 2021 May:191(5):784-794. doi: 10.1016/j.ajpath.2021.02.009. Epub 2021 Feb 27 [PubMed PMID: 33652018]
van Aert RCM. Meta-analyzing partial correlation coefficients using Fisher's z transformation. Research synthesis methods. 2023 Sep:14(5):768-773. doi: 10.1002/jrsm.1654. Epub 2023 Jul 8 [PubMed PMID: 37421188]
Schober P, Boer C, Schwarte LA. Correlation Coefficients: Appropriate Use and Interpretation. Anesthesia and analgesia. 2018 May:126(5):1763-1768. doi: 10.1213/ANE.0000000000002864. Epub [PubMed PMID: 29481436]
Bzovsky S, Phillips MR, Guymer RH, Wykoff CC, Thabane L, Bhandari M, Chaudhary V, R.E.T.I.N.A. study group. The clinician's guide to interpreting a regression analysis. Eye (London, England). 2022 Sep:36(9):1715-1717. doi: 10.1038/s41433-022-01949-z. Epub 2022 Jan 31 [PubMed PMID: 35102247]
Tenny S, Abdelgawad I. Statistical Significance. StatPearls. 2024 Jan:(): [PubMed PMID: 29083828]
Concato J, Hartigan JA. P values: from suggestion to superstition. Journal of investigative medicine : the official publication of the American Federation for Clinical Research. 2016 Oct:64(7):1166-71. doi: 10.1136/jim-2016-000206. Epub 2016 Aug 3 [PubMed PMID: 27489256]
Sullivan GM, Feinn R. Using Effect Size-or Why the P Value Is Not Enough. Journal of graduate medical education. 2012 Sep:4(3):279-82. doi: 10.4300/JGME-D-12-00156.1. Epub [PubMed PMID: 23997866]
E Alifieris C, Souferi Chronopoulou E, T Trafalis D, Arvelakis A. The arbitrary magic of p{0.05: Beyond statistics. Journal of B.U.ON. : official journal of the Balkan Union of Oncology. 2020 Mar-Apr:25(2):588-593 [PubMed PMID: 32521838]
Kim J, Bang H. Three common misuses of P values. Dental hypotheses. 2016 Jul-Sep:7(3):73-80 [PubMed PMID: 27695640]
Cohen HW. P values: use and misuse in medical literature. American journal of hypertension. 2011 Jan:24(1):18-23. doi: 10.1038/ajh.2010.205. Epub 2010 Oct 21 [PubMed PMID: 20966898]
Tanha K, Mohammadi N, Janani L. P-value: What is and what is not. Medical journal of the Islamic Republic of Iran. 2017:31():65. doi: 10.14196/mjiri.31.65. Epub 2017 Sep 25 [PubMed PMID: 29445694]
Greenland S, Senn SJ, Rothman KJ, Carlin JB, Poole C, Goodman SN, Altman DG. Statistical tests, P values, confidence intervals, and power: a guide to misinterpretations. European journal of epidemiology. 2016 Apr:31(4):337-50. doi: 10.1007/s10654-016-0149-3. Epub 2016 May 21 [PubMed PMID: 27209009]
Kallogjeri D, Piccirillo JF. A Simple Guide to Effect Size Measures. JAMA otolaryngology-- head & neck surgery. 2023 May 1:149(5):447-451. doi: 10.1001/jamaoto.2023.0159. Epub [PubMed PMID: 36951858]