The median is the middle value of a set of numbers. The median is the same as the 50th percentile for the set of numbers. In other words, the median is the middle of a set of numbers with half of the values less than the median and half the values greater than the median.
A short example will help clarify this point. Take the set of seven numbers (8, 6, 9, 5, 8, 23, 4). To find the median we first sort the numbers from lowest to highest (in ascending order). The numbers sorted are (4, 5, 6, 8, 8, 9, 23). The median is then the middle number. As there are seven numbers in the set, the fourth number is the median which is '8' in our example. Note the median is different from the mean. The mean is the "average" value in the set which would be 9 as calculated by:
What does one do if there is an even set of numbers? Then the median is the average of the two middle numbers. Use the second set of numbers to show this principle. The second set of numbers is (7, 3, 10, 2, 9, 2, 1, 4). Then, order the set of numbers and get (1, 2, 2, 3, 4, 7, 9, 10). The second set of numbers has eight numbers in it with the two middle numbers being 3 and 4. The median would then be the average of these two numbers or 3.5 (calculated as (3+4)/2 = 3.5). Note this median of 3.5 is again different than the mean or average of the set of numbers which 4.75 as calculated by:
Start with a set of numbers with a total of n members. That is there are n numbers in our set of numbers. Once the set of numbers is sorted (either ascending or descending) then the median can be calculated as follows:
For sets with an odd number of members, n is odd:
For sets with an even number of members, n is even:
Use our simple sets from above to work an example for both cases.
For a set where n is odd, we use or first set (8, 6, 9, 5, 8, 23, 4).
The set once sorted is (4, 5, 6, 8, 8, 9, 23).
In this set n = 7.
Median = value of the ((n+1)/2)th item in the sorted set
= value of ((7+1)/2)th item in sorted set
= value of ((8)/2)th item in sorted set
= value of 4th item in sorted set
For a set where n is even, use the second set (7, 3, 10, 2, 9, 2, 1, 4).
The set once sorted is (1, 2, 2, 3, 4, 7, 9, 10).
In this set n = 8.
Median = value of the [(n/2)th item + (n/2 + 1)th item] / 2 in the sorted set
= value of the [(8/2)th item + (8/2 + 1)th item] / 2 in the sorted set
= value of the [4th item + (4+1)th item] / 2 in the sorted set
= value of the [ 4th item + 5th item] / 2 in the sorted set = value of [3 + 4] / 2
= value of [7] / 2
= 3.5
The most common mistake is distinguishing between the mean, median, and mode. The mean is the "average" of the numbers in the set. The simple mean is calculated by adding all of the numbers in the set together then dividing by the number of items in the set.
The median is the topic of this discussion and is calculated as stated above in the function section.
The mode is the most common element in the set or the value that happens the most often. In the first data set (4, 5, 6, 8, 8, 9, 23), the number 8 occurs twice which is more common than any of the other numbers, so the mode of the data set is 8. One can have more than one mode as well as no mode for a dataset. In the data set (2, 3, 3, 5, 5, 7, 7) there are three modes as 3, 5 and seven all appear twice which is more than two which only appears once. This dataset would be said to be trimodal. If there were two modes, the dataset would be bimodal, and if there were more than three modes to the dataset, then the dataset would be multimodal. In a different dataset of (6, 7, 8, 9) no number appears more frequently than the others, so there is no mode to the data set.
The median is a commonly reported number in the scientific literature for a good reason. Many times in science and medicine we are interested in the time to a certain event, such as decay time of a radioisotope or survival time of a patient with specific cancer. If we start out with 11 radioisotopes with decay times of (1, 1, 1, 2, 4, 5, 5, 5, 6, 11, 41) seconds then within one minute we can calculate both the median of 5 seconds, which is the middle value of the set, as well as the mean of 7.45 seconds.
If however the same dataset of {1, 1, 1, 2, 4, 5, 5, 5, 6, 11, 41} represents survival in years for patients with specific cancer the median may be more useful. We can calculate the median after the sixth patient passes away from cancer at 5 years as half of the patients have died and half are still living. To calculate the mean, we must have the total survival of all patients. Thus we have to wait 41 years for the final patient to die before we can calculate the mean. Waiting for the event to occur in all patients to allow us to calculate the mean thus becomes a lengthy process and makes the median a more palatable number to calculate.
A second desirable characteristic of the median versus the mean is that the median is less affected by outliers or extreme values. If we return to our dataset of (1, 1, 1, 2, 4, 5, 5, 5, 6, 11, 41), we can see that 41 seems to be well away from the mean of 7.45 of the dataset. If we eliminate this outlier, we can get a new dataset of (1, 1, 1, 2, 4, 5, 5, 5, 6, 11) with a median of 4.5 and mean of 4.1. We see that the median only changed by 10% (went down to 4.5 from 5) while the mean decreased by 45% (went down to 4.1 from 7.45). Thus the median may be a better indicator of the realistic middle of the data compared to the mean in skewed data. Skewed data is any data set where there are one or more outliers or values at either the extreme high or extreme low end of the dataset.
If we go back to our dataset of (1, 1, 1, 2, 4, 5, 5, 5, 6, 11, 41) for years of survival for patients with a specific cancer is it more realistic to report the median survival is 5 years, that is that half of patients are no longer alive at 5 years? Or, is it better to report mean survival is 7.45 years, that is on average for our 11 patients they lived 7.45 years? With the mean, we should note only two lived longer than the mean. We may all want to be the outlier in this example, living 41 years, but this may not provide a realistic picture of the data.
Although median provides a measure of the middle of the data it may not always be the best measure. Sometimes the mean may be the more desirable attribute to have. Statistical analysis tends to be more robust and may be simpler to compute with the mean value.
There are cases where neither the median nor the mean is a good measure of the middle of the data. The median and mean are better measures when the data has one prominent peak, that is it is unimodal. When the data has two or more prominent peaks (e.g., bimodal, trimodal, multimodal), then the median may not provide the best information. Imagine a curve that looks like two hills on either side of a valley with a bump on the left of the data at the small values and a second bump at the right of the data with large values (a bimodal dataset). The median may mathematically be in the valley if this is where the middle value is, but the median, in this case, may fail to provide the information we find useful, namely where the peaks are and the data clusters.