Hypothesis Testing, P Values, Confidence Intervals, and Significance

Jacob Shreffler; Martin Huecker

Hypothesis Testing, P Values, Confidence Intervals, and Significance

Free Review Questions

Definition/Introduction

Medical providers often rely on evidence-based medicine to guide decision-making in practice. A research hypothesis is often tested with results provided, typically with p values, confidence intervals, or both. Additionally, the investigators estimate or determine the significance of statistical or research. Unfortunately, healthcare providers may have different comfort levels in interpreting these findings, which may affect the adequate application of the data.

Issues of Concern

Without a foundational understanding of hypothesis testing, p values, confidence intervals, and the difference between statistical and clinical significance, healthcare providers' ability to make clinical decisions without relying purely on the research investigators' deemed level of significance may be affected. Therefore, an overview of these concepts is provided to allow medical professionals to use their expertise to determine if results are reported sufficiently and if the study outcomes are clinically appropriate for healthcare practice.

Hypothesis Testing

Investigators conducting studies need research questions and hypotheses to guide analyses. Starting with broad research questions (RQs), investigators identify a gap in current clinical practice or research. Any research problem or statement is grounded in a better understanding of the relationships between 2 or more variables. For this topic, we use the following research question example:

Research Question: Is Drug 23 an effective treatment for Disease A?

Research questions do not directly imply specific guesses or predictions; we must formulate research hypotheses. A hypothesis is a predetermined declaration regarding the research question in which the investigator(s) makes a precise, educated guess about a study outcome. This is sometimes called the alternative hypothesis and ultimately allows the researcher to take a stance based on experience or insight from medical literature. An example of a hypothesis is below.

Research Hypothesis: Drug 23 significantly reduces symptoms associated with Disease A compared to Drug 22.

The null hypothesis states no statistical difference between groups based on the stated research hypothesis.

Researchers should be aware of journal recommendations when reporting p values, and manuscripts should remain internally consistent.

Regarding p values, the likelihood of finding a statistically significant effect increases as the number of individuals enrolled in a study (the sample size) increases. With very large sample sizes, the p-value can be very low, and there are significant differences in reducing symptoms for Disease A between Drug 23 and Drug 22. The null hypothesis is deemed true until a study presents significant data to support rejecting the null hypothesis. Based on the results, the investigators either reject the null hypothesis (if they found significant differences or associations) or fail to reject the null hypothesis (they could not prove that there were significant differences or associations).

To test a hypothesis, researchers obtain data on a representative sample to determine whether to reject or fail to reject a null hypothesis. In most research studies, obtaining data for an entire population is not feasible. Using a sampling procedure allows for statistical inference, though this involves a certain possibility of error.[1] When determining whether to reject or fail to reject the null hypothesis, mistakes can be made: Type I and Type II errors. Though it is impossible to ensure that these errors have not occurred, researchers should limit the possibilities of these faults.[2]

Significance

Significance is a term used to describe the substantive importance of medical research. Statistical significance is the likelihood of results due to chance.[3] Healthcare providers should always delineate statistical significance from clinical significance, a common error when reviewing biomedical research.[4] When conceptualizing findings reported as significant or not significant, healthcare providers should not simply accept researchers' results or conclusions without considering the clinical significance. Healthcare professionals should consider the clinical importance of findings and understand p values and confidence intervals so they do not rely on the researchers to determine the significance level.[5] One criterion often used to determine statistical significance is using p values.

P Values

P values are used in research to determine whether the sample estimate significantly differs from a hypothesized value. The p-value is the probability that the observed effect within the study would have occurred by chance if, in reality, there was no true effect. Conventionally, data yielding a p<0.05 or p<0.01 is considered statistically significant. While some have debated that the 0.05 level should be lowered, it is still universally practiced.[6] Hypothesis testing allows us to determine the size of the effect.

Two examples of findings reported with p values are below:

Statement: Drug 23 reduced patients' symptoms compared to Drug 22. Patients who received Drug 23 (n=100) were 2.1 times less likely than patients who received Drug 22 (n = 100) to experience symptoms of Disease A, p<0.05.
Statement: Individuals prescribed Drug 23 experienced fewer symptoms (M = 1.3, SD = 0.7) than those prescribed Drug 22 (M = 5.3, SD = 1.9). This finding was statistically significant, p= 0.02.

For either statement, if the threshold had been set at 0.05, the null hypothesis (that there was no relationship) should be rejected, and we should conclude significant differences. Noticeably, as can be seen in the 2 statements above, some researchers report findings with < or >, and others provide an exact p-value (0.000001) but never 0[6]. When examining research, readers should understand how p values are reported. The best practice is to report all p values for all variables within a study design rather than only providing p values for variables with significant findings.[7] Including all p values provides evidence for study validity and limits suspicion for selective reporting/data mining.

While researchers have historically used p values, experts who find p values problematic encourage using confidence intervals.[8]. P-values alone do not allow us to understand the size or the extent of the differences or associations.[3] In March 2016, the American Statistical Association (ASA) released a statement on p-values, noting that scientific decision-making and conclusions should not be based on a fixed p-value threshold (eg, 0.05). They recommend focusing on the significance of results in the context of study design, quality of measurements, and validity of data. Ultimately, the ASA statement noted that a p-value does not provide strong evidence in isolation.[9]

When conceptualizing clinical work, healthcare professionals should consider p values with a concurrent appraisal study design validity. For example, a p-value from a double-blinded randomized clinical trial (designed to minimize bias) should be weighted higher than one from a retrospective observational study [7]. The p-value debate has smoldered since the 1950s[10], and replacement with confidence intervals has been suggested since the 1980s.[11]

Confidence Intervals

A confidence interval provides a range of values within a given confidence (eg, 95%), including the accurate value of the statistical constraint within a targeted population.[12] Most research uses a 95% CI, but investigators can set any level (eg, 90% CI, 99% CI).[13] A CI provides a range with the lower bound and upper bound limits of a difference or association that would be plausible for a population.[14] Therefore, a CI of 95% indicates that if a study were to be carried out 100 times, the range would contain the true value in 95,[15] confidence intervals provide more evidence regarding the precision of an estimate compared to p-values.[6]

In consideration of the similar research example provided above, one could make the following statement with 95% CI:

Statement: Individuals who were prescribed Drug 23 had no symptoms after 3 days, which was significantly faster than those prescribed Drug 22; there was a mean difference between the 2 groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

It is important to note that the width of the CI is affected by the standard error and the sample size; reducing a study sample number results in less precision of the CI (increase the width).[14] A larger width indicates a smaller sample size or a larger variability.[16] A researcher would want to increase the precision of the CI. For example, a 95% CI of 1.43 – 1.47 is much more precise than the one provided in the example above. In research and clinical practice, CIs provide valuable information on whether the interval includes or excludes clinically significant values.[14]

Null values are sometimes used for differences with CI (0 for differential comparisons and 1 for ratios). However, CIs provide more information than that.[15] Consider this example: A hospital implemented a new protocol that reduced wait time for patients in the emergency department by an average of 25 minutes (95% CI: -2.5 – 41 minutes). Because the range crosses 0, implementing this protocol in different populations could result in longer wait times; however, the range is much higher on the positive side. Thus, while the p-value used to detect statistical significance may result in "not significant" findings, individuals should examine this range, consider the study design, and weigh whether or not it is still worth piloting in their workplace.

Similarly to p-values, 95% CIs cannot control for researchers' errors (eg, study bias or improper data analysis).[14] Regarding whether to report p-values or CIs, researchers should examine journal preferences. When in doubt, reporting both may be beneficial.[13] An example is below:

Reporting both: Individuals prescribed Drug 23 had no symptoms after 3 days, which was significantly faster than those prescribed Drug 22, p = 0.009. There was a mean difference between the 2 groups of days to the recovery of 4.2 days (95% CI: 1.9 – 7.8).

Clinical Significance

Recall that clinical significance and statistical significance are 2 different concepts. Healthcare providers should remember that a study with statistically significant differences and large sample size may be of no interest to clinicians, whereas a study with a smaller sample size and statistically non-significant results could impact clinical practice.[14] Additionally, as previously mentioned, a non-significant finding may reflect the study design itself rather than relationships between variables.

Healthcare providers using evidence-based medicine to inform practice should use clinical judgment to determine the practical importance of studies through careful evaluation of the design, sample size, power, likelihood of type I and type II errors, data analysis, and reporting of statistical findings (p values, 95% CI or both).[4] Interestingly, some experts have called for "statistically significant" or "not significant" to be excluded from work as statistical significance never has and never be equivalent to clinical significance.[17]

The decision on what is clinically significant can be challenging, depending on the providers' experience and especially the severity of the disease. Providers should use their knowledge and experiences to determine the meaningfulness of study results and make inferences based not only on significant or insignificant results by researchers but on their understanding of study limitations and practical implications.

Nursing, Allied Health, and Interprofessional Team Interventions

All physicians, nurses, pharmacists, and other healthcare professionals should strive to understand the concepts in this chapter. These individuals should maintain the ability to review and incorporate new literature for evidence-based and safe care.

Details

References

[1]

Jones M, Gebski V, Onslow M, Packman A. Statistical power in stuttering research: a tutorial. Journal of speech, language, and hearing research : JSLHR. 2002 Apr:45(2):243-55 [PubMed PMID: 12003508]

[2]

Sedgwick P. Pitfalls of statistical hypothesis testing: type I and type II errors. BMJ (Clinical research ed.). 2014 Jul 3:349():g4287. doi: 10.1136/bmj.g4287. Epub 2014 Jul 3 [PubMed PMID: 24994622]

[3]

Fethney J. Statistical and clinical significance, and how to use confidence intervals to help interpret both. Australian critical care : official journal of the Confederation of Australian Critical Care Nurses. 2010 May:23(2):93-7. doi: 10.1016/j.aucc.2010.03.001. Epub 2010 Mar 29 [PubMed PMID: 20347326]

[4]

Hayat MJ. Understanding statistical significance. Nursing research. 2010 May-Jun:59(3):219-23. doi: 10.1097/NNR.0b013e3181dbb2cc. Epub [PubMed PMID: 20445438]

Level 3 (low-level) evidence

[5]

Ferrill MJ, Brown DA, Kyle JA. Clinical versus statistical significance: interpreting P values and confidence intervals related to measures of association to guide decision making. Journal of pharmacy practice. 2010 Aug:23(4):344-51. doi: 10.1177/0897190009358774. Epub 2010 Apr 13 [PubMed PMID: 21507834]

[6]

Infanger D, Schmidt-Trucksäss A. P value functions: An underused method to present research results and to promote quantitative reasoning. Statistics in medicine. 2019 Sep 20:38(21):4189-4197. doi: 10.1002/sim.8293. Epub 2019 Jul 3 [PubMed PMID: 31270842]

[7]

Dorey F. Statistics in brief: Interpretation and use of p values: all p values are not equal. Clinical orthopaedics and related research. 2011 Nov:469(11):3259-61. doi: 10.1007/s11999-011-2053-1. Epub [PubMed PMID: 21918804]

[8]

Liu XS. Implications of statistical power for confidence intervals. The British journal of mathematical and statistical psychology. 2012 Nov:65(3):427-37. doi: 10.1111/j.2044-8317.2011.02035.x. Epub 2011 Oct 25 [PubMed PMID: 22026811]

[9]

Tijssen JG, Kolm P. Demystifying the New Statistical Recommendations: The Use and Reporting of p Values. Journal of the American College of Cardiology. 2016 Jul 12:68(2):231-3. doi: 10.1016/j.jacc.2016.05.026. Epub [PubMed PMID: 27386779]

[10]

Spanos A. Recurring controversies about P values and confidence intervals revisited. Ecology. 2014 Mar:95(3):645-51 [PubMed PMID: 24804448]

[11]

Freire APCF, Elkins MR, Ramos EMC, Moseley AM. Use of 95% confidence intervals in the reporting of between-group differences in randomized controlled trials: analysis of a representative sample of 200 physical therapy trials. Brazilian journal of physical therapy. 2019 Jul-Aug:23(4):302-310. doi: 10.1016/j.bjpt.2018.10.004. Epub 2018 Oct 16 [PubMed PMID: 30366845]

Level 1 (high-level) evidence

[12]

Dorey FJ. In brief: statistics in brief: Confidence intervals: what is the real result in the target population? Clinical orthopaedics and related research. 2010 Nov:468(11):3137-8. doi: 10.1007/s11999-010-1407-4. Epub [PubMed PMID: 20532716]

[13]

Porcher R. Reporting results of orthopaedic research: confidence intervals and p values. Clinical orthopaedics and related research. 2009 Oct:467(10):2736-7. doi: 10.1007/s11999-009-0952-1. Epub 2009 Jun 30 [PubMed PMID: 19565303]

[14]

Gardner MJ, Altman DG. Confidence intervals rather than P values: estimation rather than hypothesis testing. British medical journal (Clinical research ed.). 1986 Mar 15:292(6522):746-50 [PubMed PMID: 3082422]

[15]

Cooper RJ, Wears RL, Schriger DL. Reporting research results: recommendations for improving communication. Annals of emergency medicine. 2003 Apr:41(4):561-4 [PubMed PMID: 12658257]

[16]

Doll H, Carney S. Statistical approaches to uncertainty: P values and confidence intervals unpacked. Equine veterinary journal. 2007 May:39(3):275-6 [PubMed PMID: 17520981]

[17]

Colquhoun D. The reproducibility of research and the misinterpretation of p-values. Royal Society open science. 2017 Dec:4(12):171085. doi: 10.1098/rsos.171085. Epub 2017 Dec 6 [PubMed PMID: 29308247]