Statistics for Data Science

In today’s world, a large amount of data resulting from the technological revolution, the Internet of things, Information Technology, and several gadgets, including mobile phones, generate a huge amount of data that needs to be analyzed by the organizations to create insights for better performance and profitability. This analysis can be done by discovering patterns and trends from these huge data sets, which can be achieved by using Statistics. Using statistics, we can get many clear and fine insights into how exactly our data is structured, and based on that structure, how we can optimally apply data science techniques to get even more information. We can also say that Statistics is the heart of Data Science. Statistical methods can store and manipulate large amounts of data using cloud computing and parallel computing.

Structure of Statistics

Structure of Statistics

                                    Source- Image by the Author

Descriptive statistics:

Descriptive Statistics is summarizing or describing the features of the data to make the data easier to understand. This is just a fine representation of the information (data) available to us.

Measures used:-

  1. Measures of Central Tendency
  2. Measures of Dispersion (or Variability).

Measures of Central Tendency

Whenever we measure things of the same kind, a fairly huge number of such measurements will tend to cluster around the middle value. Such a value is called a measure of “Central tendency”.

  1. Mean: Mean is defined as the ratio of the sum of all the values in the data set to the total number of observations. When outliers are present in data it is not a reliable metric to measure
  2. Median: Median is a point that divides the data into two equal halves,it is least influenced by outliers and skewness of the data.
  3. Mode: Mode is that value that has the maximum frequency of occurrence. It also has resistance to outliers.

“An outlier is an observation that lies at an abnormal distance from other data points”.

Measures of Dispersion

It indicates the spread of distribution around the Central Tendency.

  1. Range: The difference between the maximum and minimum values in a dataset.
  2. Variance: It indicates how far the data points are spread out from the mean of the observations. High variance indicates data points are spread widely while low variance indicates that the data points closer to the mean. Greater variance greater the inconsistency
  3. Standard deviation: It is the square root of the variance. This also a measure of risk and volatility in organizations.
  4. Coefficient of variation: This is also called relative dispersion, defined as the ratio of the standard deviation to mean.


Probability is a measure of uncertainty. Probability applies to Data science because, in the real world, we need to arrive at decisions with incomplete information. Hence, we require a mechanism to quantify uncertainty – which Probability provides us. With the use of Probability, we can model elements of uncertainty, such as risk in financial transactions and many other business processes. It quantifies the likelihood or belief that an event will occur or not.

“An event is the set of outcomes of a random experiment to which Probability is assigned”.

There are two types of Events:-

  1. Mutually Exclusive Events:- If one event occurs the other does not occur, Example in a shuffle of cards if a king occurs in a random pick then the queen will not occur, and so on forth.
  2. Independent Events:- Two individual Events occur without influencing each other Example- tossing a coin two times.

Probability Distributions 

Probability Distribution is a total listing of the various values the random variables can take along with the corresponding probability of each other.

Types of Probability Distribution

  • Normal Distribution

Normal Distribution is also known as Gaussian distribution is a form where the data near to mean (*mean is a symmetrical line in the distribution)are more frequent in occurrence whereas the data far from the mean are less frequent. 

Probability Distribution

Source- Wikimedia commons

  • Poisson Distribution

Poisson Distribution is used to describe the distribution of rare events in a large population.

Poisson Distribution

Source- Wikimedia commons

  • Binomial Distribution

It is done where there is a probability of either success or failure of only two possibilities.

Binomial Distribution

Source- Wikimedia commons

Inferential Statistics

Inferential Statistics in Data Science allows making predictions from the available data. This is done by testing the data. The testings can be classified in two ways:

  1. Hypothesis testing
  2. Estimation method.

Hypothesis is an assumption that may or may not be true.

  1. Null Hypothesis:- It is formulated in such a way that the rejection of the null hypothesis proves the alternate hypothesis is true. Our main motive is to prove the alternate hypothesis to be true.
  2. Alternate Hypothesis:- It is a state where something is happening, it is just the reverse of a null hypothesis.

For Example, If we want to prove that the Indian cricket team is going to win the 2023 world cup to be held in India based on their past performances then it is our alternate hypothesis so our null hypothesis will be India is going to lose in 2023 world cup.


Statistics is the backbone of Data Science. It gives us information about the data of how it’s distributed, how we can deal with different variances, and how the data needs to be treated for further analysis.

This article has covered the basics of Statistics for data science. Though, there is much more to explore when we talk about Statistics. It also covered the basic theoretical knowledge of probability and its distribution.

Hope you enjoyed reading.


Sharing it to help others: