One of the main objectives of statistical analysis is to determine various numerical measures that describe the fundamental characteristics of a frequency distribution. Among the first and most widely used of these measures is the average. The term “average” is frequently used in everyday language to express a general or typical value. In statistics, the central goal is to summarize a dataset by identifying a single value that best represents the entire dataset. This single value is referred to as an average or a measure of central value.
Definition of Average
Over time, various statisticians have offered their definitions of the average, each emphasizing the purpose and utility of this statistical concept.
Clark defined average as an attempt to find one single figure to describe the whole of the figures.
According to A.L.. Bowley, averages are statistical constants which enable us to comprehend in a single effort the significance of the whole.
Croxton and Cowden described an average as a single value within the range of the data that is used to represent all the values in the series. Since an average is situated somewhere within the range of the data, it is sometimes called a measure of central value.
Properties of a Good Average
A reliable and effective average should possess certain desirable characteristics that enhance its utility and relevance in statistical analysis.
It should be rigidly defined. This means that its definition must be unambiguous, leading to one and only one interpretation under all circumstances.
It should be simple to understand and easy to calculate. The process of finding an average should not require advanced mathematical skills, making it accessible even to those without a technical background.
It should be based on all the observations in the dataset. A good average utilizes the entire data set in its computation, ensuring that no information is lost or ignored.
It should be capable of further algebraic treatment. A useful average should allow further mathematical manipulation and analysis, thus increasing its value in extended statistical work.
It should not be unduly influenced by extreme values. A good average is resistant to distortion caused by outliers or unusually large or small values in the dataset.
It should demonstrate sampling stability. This means that if different random samples of the same size are taken from a large population, the average of each sample should be approximately the same, indicating consistency and reliability.
Various Measures of Central Tendency
There are several commonly used measures of central tendency in statistics, each serving specific purposes and applicable in different contexts. The most prominent among these include the arithmetic mean, the median, and the mode.
Arithmetic Mean
The arithmetic mean, often simply called the mean, is the most commonly used measure of central tendency. It is calculated by dividing the sum of all values in a dataset by the number of values. The arithmetic mean is of two types: simple arithmetic mean and weighted arithmetic mean.
Simple Arithmetic Mean or Mean
In Case of Ungrouped Data
Individual Observations
Let X₁, X₂, …, Xₙ be the observations in a dataset. The arithmetic mean of these observations, usually denoted by X̄, is calculated using the formula:
X̄ = (X₁ + X₂ + … + Xₙ) / n
Where n is the number of observations.
Short-cut Method
This method is useful when the data values are large or complex. It is calculated using the formula:
X̄ = A + Σ(X – A) / n
Where A is the assumed mean and X represents each observation.
In Case of Discrete Frequency Distribution
Direct Method
When data is presented as a frequency distribution, the mean is calculated using the formula:
X̄ = ΣfX / Σf
Where f is the frequency of each observation and X represents the values.
Short-cut Method
In the shortcut method, the formula becomes:
X̄ = A + (Σfd / Σf)
Where A is the assumed mean and d = X – A.
In Case of Grouped Frequency Distribution or Continuous Series
Direct Method
For grouped or continuous data, the mean is computed using the mid-values of each class interval. The formula is:
X̄ = Σfm / Σf
Where m is the mid-value of each class interval.
Short-cut Method
In this method, the formula becomes:
X̄ = A + (Σfd / Σf)
Where d = m – A and m is the mid-value of each class, A is the assumed mean.
Step-Deviation Method or Coding Method
This method simplifies computation when class intervals are uniform. The formula used is:
X̄ = A + (Σfu / Σf) × i
Where A is the assumed mean, u = (m – A) / i, and i is the class interval size.
Properties of The Arithmetic Mean
The arithmetic mean has several important mathematical properties that distinguish it from other measures of central tendency.
The sum of deviations of the items from the mean is always zero. This means that if each value in a dataset is subtracted from the mean and the results are summed, the total will be zero.
The sum of the squared deviations of the items from the mean is the smallest possible. This property makes the arithmetic mean the most efficient central value when minimizing error.
If each value in a dataset is increased or decreased by a constant k, the arithmetic mean also increases or decreases by k. This shows the mean’s linearity.
If each value is multiplied by a constant k, the arithmetic mean is also multiplied by k.
Combined Arithmetic Mean
When dealing with two or more groups with different means and sizes, the combined mean can be calculated using the formula:
Combined Mean = (N₁X̄₁ + N₂X̄₂) / (N₁ + N₂)
Where N₁ and N₂ are the number of observations in each group and X̄₁ and X̄₂ are their respective means. This formula can be extended to more groups as needed.
Merits of the Arithmetic Mean
The arithmetic mean is widely used due to its several advantages.
It is easy to compute and understand, making it accessible to users at all levels.
It considers every observation in the dataset, ensuring a complete analysis.
It is stable across different samples, making it reliable for inferential statistics.
It does not depend on the position of the data values but on their actual numerical values.
It is suitable for further mathematical operations, which adds to its analytical power.
It is rigidly defined so that different people using the same formula on the same data will get the same result.
Demerits of the the Arithmetic Mean
Despite its many advantages, the arithmetic mean has some limitations.
It is highly sensitive to extreme values. A single very high or very low value can distort the mean significantly. For example, the mean of 55, 54, 49, 50, and 5 is 42.6. The single value 5 drastically reduces the average and misrepresents the dataset.
It cannot be determined by visual inspection like the mode, nor can it be located graphically.
In datasets with open-end class intervals where the lower or upper limits are not known, the mean may be inaccurate unless assumptions are made.
It is not suitable for qualitative data such as ratings of honesty, appearance, or personality traits. In such cases, other measures like median or rank-based methods are preferred.
The mean is not a suitable central measure in U-shaped distributions or in datasets that deviate significantly from normality.
Weighted Arithmetic Mean
In some cases, not all observations carry equal importance. When different values in a dataset are assigned varying levels of significance or frequency, a weighted arithmetic mean is used instead of a simple mean. This is especially useful in situations like calculating average marks where subjects have different credit hours or calculating average price with different quantities.
The formula for weighted mean is:
X̄ = ΣWX / ΣW
Where X denotes the variable, W is the weight associated with each value, and Σ represents the summation over all values.
Weighted mean gives a more accurate average when dealing with data points of unequal importance. If all weights are equal, the weighted mean becomes the same as the simple arithmetic mean.
Median
The median is a positional average. It refers to the middle value in a dataset when the values are arranged in either ascending or descending order. It divides the dataset into two equal parts such that half of the observations are less than the median and the other half are greater.
Median for Individual Observations
For an ungrouped dataset, the median is found by arranging the values in ascending order and identifying the middle term.
If the number of observations n is odd, the median is the value at the (n + 1)/2 position.
If n is even, the median is the average of the values at the n/2 and (n/2) + 1 positions.
Median for Discrete Frequency Distribution
In a discrete frequency distribution, the cumulative frequency is used to locate the median class.
Steps to calculate the median:
Arrange the values of the variable in ascending order along with their corresponding frequencies.
Calculate cumulative frequencies.
Find the position of the median using the formula (N + 1)/2, where N is the total number of observations.
Locate the cumulative frequency just greater than or equal to this position. The corresponding value is the median.
Median for Grouped Frequency Distribution
In the case of a continuous series or grouped frequency distribution, the median is calculated using the formula:
Median = L + [(N/2 − F) / f] × h
Where
L = lower boundary of the median class
N = total frequency
F = cumulative frequency preceding the median class
f = frequency of the median class
h = width of the class interval
This method is useful in large datasets, where identifying the central position directly is not feasible.
Properties of Median
The median is unaffected by extreme values. Since it depends only on the position of the values and not their actual magnitude, it remains stable even if the dataset contains very large or very small observations.
It can be determined by graphical methods using ogives. The point of intersection of less-than and more-than ogives gives the value of the median.
The median is suitable for qualitative data. It can be used for variables like income levels, rankings, or satisfaction ratings.
The median can be calculated for open-end distributions since it depends on the cumulative frequency and not the specific boundaries of the class intervals.
Merits of Median
The median is easy to compute and understand, particularly in small datasets.
It is not affected by extreme values or outliers, making it more representative of the central location in skewed distributions.
It can be located graphically, offering a visual understanding of the data distribution.
It is suitable for ordinal data and qualitative characteristics.
In case of open-ended class intervals, the median can be calculated reliably.
Demerits of Median
It is not based on all the values of the dataset. Since only the middle value is considered, the median may ignore important variations in the dataset.
It is not suitable for further algebraic treatment. Unlike the mean, the median cannot be used in mathematical formulas for additional statistical analysis.
In small datasets with an even number of observations, the median might not correspond to an actual observation in the dataset.
Its calculation is less precise in complex grouped frequency distributions if class intervals are wide or not uniform.
Mode
The mode is the value that occurs most frequently in a dataset. It represents the most typical or common value. A dataset may have one mode (unimodal), two modes (bimodal), or more (multimodal).
Mode for Individual Observations
To identify the mode, simply find the value that appears most frequently. In cases where no value repeats, the dataset is said to have no mode.
Mode for Discrete Frequency Distribution
In this case, the value of the variable with the highest frequency is taken as the mode. If two or more values share the highest frequency, the distribution is bimodal or multimodal.
Mode for Grouped Frequency Distribution
When the data is in the form of a continuous frequency distribution, the mode is calculated using the formula:
Mode = L + [(f₁ − f₀) / (2f₁ − f₀ − f₂)] × h
Where
L = lower boundary of the modal class
f₁ = frequency of the modal class
f₀ = frequency of the class preceding the modal class
f₂ = frequency of the class succeeding the modal class
h = width of the class interval
This formula assumes that the modal class is the class with the highest frequency.
Properties of Mode
Mode is the only average that can be used with nominal data. It can be applied to categories like color, brand preference, or types of defects.
It is simple to locate, especially in small datasets where the most frequent value is easily visible.
It can be used when the data is qualitative or categorical.
Merits of Mode
The mode is easy to understand and identify in small datasets.
It is not affected by extreme values, making it useful in skewed distributions.
It is the only average applicable for categorical or qualitative data.
The mode can be used for decision-making in practical fields like market analysis, business strategy, and consumer behavior.
Demerits of Mode
The mode may not exist in a dataset, or there may be more than one mode, making interpretation difficult.
It is not based on all the observations in the dataset.
It cannot be used for further algebraic treatment or advanced statistical analysis.
In grouped data, the mode can be misleading if the frequencies are close in value or if the modal class is not clearly defined.
Empirical Relationship Between Mean, Median, and Mode
In a moderately skewed distribution, there exists a commonly used empirical relationship among mean, median, and mode:
Mode = 3 × Median − 2 × Mean
This formula is not mathematically exact,, but is useful for estimating one of the measures when the other two are known. It also helps assess the degree of skewness in the distribution.
Comparison of Mean, Median, and Mode
Understanding the differences between mean, median, and mode is essential for selecting the most appropriate measure of central tendency based on the nature of the data and the purpose of the analysis.
Basis of Calculation
The mean is calculated using all values in the dataset, by taking their sum and dividing it by the number of observations. The median relies only on the position of values once they are arranged in order, and the mode depends on the frequency of repetition of values.
Use of All Observations
The mean uses every data point, making it highly sensitive to changes in any observation. The median only considers the middle value(s), and the mode focuses solely on the most frequent value(s), which can lead to significant differences in outcomes, especially in skewed distributions.
Effect of Extreme Values
The mean is easily affected by outliers or extreme values. Even one large or small value can shift the mean significantly. The median, being a positional measure, remains stable in the presence of extreme values. The mode is unaffected by extremes since it is based on frequency.
Algebraic Treatment
The mean allows further mathematical operations such as addition, multiplication, differentiation, and integration, which is why it is commonly used in theoretical and applied statistics. The median and mode do not support such operations and are limited in their mathematical applicability.
Graphic Representation
The mean and median can be represented graphically using histograms, frequency polygons, or ogives. The median is especially useful in cumulative frequency graphs, while the mode can be located using the highest point of a histogram. Visual tools help in identifying skewness and distribution symmetry.
Suitability for Different Types of Data
The mean is most suitable for quantitative, symmetrical distributions without extreme values. The median is ideal for ordinal data, skewed distributions, or data with open-end class intervals. The mode is best used for nominal or categorical data where numerical calculations are not meaningful.
Examples Illustrating the Differences
Consider the dataset: 10, 12, 14, 15, 16, 18, 100. The mean is distorted by the extreme value 100 and becomes higher than the majority of the values. In contrast, the median (15) is unaffected by this outlier and gives a better sense of central tendency. The mode does not exist in this example since no value repeats.
In another dataset: 2, 2, 3, 4, 4, 4, 5, 5, 6, 7, the mode is 4, which appears most frequently, the mean is approximately 4.27, and the median is also 4. These values are close, indicating a fairly symmetrical distribution.
Selection of Appropriate Average
Choosing the correct measure of central tendency depends on several factors such as the nature of the data, the presence of outliers, the level of measurement, and the purpose of analysis.
Nature of Data
If the data is numerical and symmetric, the mean is generally preferred. If the data is ordinal or skewed, the median is more reliable. For categorical data, where arithmetic operations are not applicable, the mode is the only suitable measure.
Presence of Outliers
In datasets with extreme values or outliers, the mean may give a misleading picture. For instance, income data often includes a few individuals with extremely high earnings, which inflates the mean. In such cases, the median provides a better central value.
Data with Open-End Intervals
When the first or last class intervals of a frequency distribution are open-ended, the mean cannot be accurately calculated. The median can still be determined based on cumulative frequencies, making it the preferred measure in such cases.
Skewed Distributions
In a positively skewed distribution, the mean is greater than the median, which in turn is greater than the mode. In negatively skewed distributions, the order reverses. In both cases, the median provides a better representation of the center.
Graphical Representation of Central Tendency
Graphs can be powerful tools to visualize the location and relation ot the mean, median, and mode in a dataset.
Histogram
In a histogram, the mode can be identified as the class with the highest frequency. In symmetrical distributions, the mean, median, and mode lie at the center. In skewed distributions, their relative positions shift depending on the direction of skewness.
Ogive
Ogives are cumulative frequency graphs that are used to locate the median. A vertical line from the 50 percent cumulative frequency point intersects the x-axis at the median value.
Frequency Polygon
A frequency polygon can illustrate how data is distributed across the range. If the distribution is bell-shaped, the mean, median, and mode coincide at the peak. If the polygon is skewed, the three measures diverge.
Relationship Among the Measures
The empirical relationship among mean, median, and mode is often observed in moderately skewed distributions:
Mode = 3 × Median − 2 × Mean
This relationship helps estimate one measure when the others are known and also provides insight into the skewness of the data. If the values deviate significantly from this relation, it suggests a highly skewed distribution or anomalies in data collection.
Use of Averages in Real-Life Applications
Averages are not limited to theoretical or academic purposes. They play a crucial role in various real-world applications across different fields.
Education
In education, averages are used to calculate grade point averages, compare student performance, and set academic standards. Mean scores help administrators evaluate overall class performance.
Economics
Economists use averages to analyze inflation rates, income levels, unemployment figures, and GDP growth. Averages provide insights into economic health and trends.
Business and Industry
Businesses use averages to assess production levels, employee performance, and customer satisfaction. For example, average sales figures help in setting targets and forecasting.
Health and Medicine
In health statistics, averages are used to determine normal ranges for blood pressure, cholesterol, and other vital signs. Median survival rates are often reported in clinical trials to indicate the effectiveness of treatment.
Sports
Sports analysts rely on averages such as batting average, goal average, or time averages to compare players’ performances and rank them.
Government and Policy
Governments use averages in census data, labor statistics, and policy assessments. For instance, average household income is used to determine eligibility for subsidies or social programs.
Limitations of Central Tendency Measures
While measures of central tendency are widely used in statistics for summarizing data, they also come with certain limitations that must be understood for their effective application.
Not Sufficient to Describe the Data
A single average cannot capture the spread or variability in data. For example, two datasets may have the same mean but very different ranges or dispersions. Hence, measures of central tendency must often be used alongside measures of dispersion such as standard deviation, variance, or interquartile range.
May Be Misleading in Skewed Distributions
In heavily skewed distributions, relying solely on the mean may give a distorted view of the dataset. Median or mode might provide a better representation in such cases. Blind application of averages without considering the shape of the distribution can lead to incorrect conclusions.
Difficulty in Interpretation
Certain datasets may contain multiple modes or no mode at all. Similarly, calculating the median in complex or grouped datasets may require assumptions and estimations. Interpretation becomes challenging when the data is incomplete, inconsistent, or includes open-ended intervals.
Not Applicable to All Types of Data
The mean requires numerical data and is meaningless for nominal or qualitative variables. The median is also restricted in its applicability to ordinal and interval data. Only the mode can be applied to nominal data, but it may not always exist or be useful in analysis.
Sensitive to Methodological Errors
Inaccurate data collection, incorrect class intervals, or computational errors can significantly affect averages. Especially in grouped data, errors in mid-value calculation or incorrect frequency classification can lead to wrong results.
Misuse and Misinterpretation of Averages
Averages are often misused or misinterpreted due to a lack of understanding or intentional manipulation. Recognizing common issues helps avoid such mistakes.
Using the Wrong Measure
Using the mean instead of the median in a highly skewed dataset can misrepresent the data. For instance, reporting the mean income in a population where a few individuals earn excessively high salaries may not reflect the true earning condition of the majority.
Ignoring Data Distribution
When averages are used without examining the underlying distribution, important patterns may be overlooked. For example, a dataset with two distinct peaks (bimodal) might be poorly represented by any single average.
Lack of Context
An average presented without sufficient context can be misleading. A mean test score of 65 might appear low or high depending on the difficulty of the test, class size, or historical performance levels.
Overgeneralization
Using a single average to generalize about diverse groups can hide meaningful differences. For instance, combining male and female height data into a single mean height may obscure gender-based variations.
Best Practices for Using Measures of Central Tendency
To make effective and accurate use of averages, certain best practices should be followed.
Understand the Nature of Data
Before selecting an average, determine whether the data is numerical, categorical, or ordinal. Consider the presence of outliers, the symmetry of the distribution, and the purpose of the analysis.
Use in Combination with Other Measures
Averages should be used alongside measures of dispersion and graphical representations to gain a complete understanding of the data. For example, pairing the mean with standard deviation gives insight into both central location and variability.
Check for Consistency and Accuracy
Ensure that data is accurately collected, properly classified, and correctly processed before computing averages. Double-check formulas and computational steps, especially in grouped data.
Present with Context
When reporting an average, always include relevant context such as sample size, range, or any known anomalies. Clearly state the type of average used to avoid confusion.
Modern Applications and Extensions
In modern statistical analysis, the basic principles of central tendency are extended and adapted for complex datasets and advanced computations.
Weighted Averages in Economics
Economic indicators like price indices, stock market averages, and cost-of-living measures often use weighted means where different components have varying levels of influence.
Moving Averages in Time Series
In time series analysis, moving averages are used to smooth out fluctuations and identify long-term trends. Simple, weighted, and exponential moving averages are used in forecasting and financial modeling.
Central Tendency in Machine Learning
In clustering algorithms like k-means, centroids representing the mean of data points in a cluster are used to classify and group data. Central tendency plays a key role in data preprocessing and feature engineering.
Robust Statistics
When datasets contain significant outliers or are not normally distributed, robust measures like trimmed mean or Winsorized mean are used to reduce the effect of anomalies while maintaining the utility of the average.
Summary of Key Concepts
Measures of central tendency are essential tools in summarizing data. The three main types are the mean, median, and mode. Each has unique properties and is suited for specific types of data.
The mean is arithmetic in nature and affected by all data points. It is widely used due to its simplicity and mathematical properties. However, it is sensitive to outliers.
The median represents the central position in an ordered dataset and is preferred when the distribution is skewed or when extreme values are present.
The mode identifies the most frequently occurring value and is suitable for categorical data.
Understanding when and how to use each measure helps ensure accurate analysis and interpretation. In addition, averages should not be used in isolation but supplemented with measures of variability and visualizations.
Final Thoughts
The concept of average and the broader category of measures of central tendency are fundamental to both theoretical and applied statistics. They provide a starting point for understanding complex data and are integral in decision-making processes across disciplines.
However, the misuse or misinterpretation of averages can lead to incorrect conclusions. Choosing the appropriate measure based on the characteristics of the data, considering accompanying measures of dispersion, and providing context are essential for effective statistical communication.