Introduction of statistics

Parmita Biswas
Analytics Vidhya
Published in
4 min readSep 4, 2020

--

Photo by Isaac Smith on Unsplash

Its very important to know about statistics . May you be a from a finance background, may you be data scientist or a data analyst, life is all about mathematics. As per the wiki definition “Statistics is the discipline that concerns the collection, organization, analysis, interpretation and presentation of data. In applying statistics to a scientific, industrial, or social problem, it is conventional to begin with a statistical population or a statistical model to be studied.”

Today in this article, we will go through the basics of statistics and in the next few articles we can deep dive.

Things covered in this article:

· Data type

· Distributions

· Sampling and distribution

· Hypothesis testing

Data type:

Roughly we can divide data into 2 types. Categorical and Numerical. Categorical is further divided into Nominal and Ordinal. Numerical is divided into Discrete and continuous.

Data Types

Examples:

1. What are the names of the students ?[Options — Tony, Harry, Tom, Alex].

[ Tony, Harry, Tom, Alex] -> is called the sample space. And these are categorical data. This is Nominal data too because this is used for naming or labeling variables, without any quantitative value.

2. Which rating would you give to “XYZ” movie? [Very good, Good, Bad, Worse]

This is also categorical data, but ordinal as this has a set order or a scale associated with it.

3. How many students are there in a class? [ 2,3,4…10……100]

This is an example of discrete data as this can take only certain values. We can’t have students as 2.5. So, it can have only certain values.

4. What is the height of the students? [1–10]

This is an example of continuous data. The height can take any values like 1.2, 1.87, 1.09 etc. These numbers can have any decimal point and can divide these if we want.

Distributions

How are marks of students distributed?

Minimum marks : 20

Maximum marks : 100

This means that the marks are distributed between 20 to 100. So, this can be represented in the form of a PDF (probability distribution function).

PDF — Probability Distribution Curve

This can be read as — the distribution of the marks of the students (population) are from 20 to 100. All other students will have marks between these two numbers. Or in other words –in terms of probability density function its the probability of selecting someone at random from that population at every given mark. So the probability that someone will have marks around the center ( 60 ) will be more compared to someone having marks as 25 or 95. If I select someone at random, there is highest probability that I would choose a student with marks around 60(the mean ).This curve is called bell curve or a normal distribution curve. The distribution is symmetrical.

Some common terms used in statistics:

Terminologies

When we take a sample these variables symbols changes. These are X̄ for mean, S for standard deviation, p for proportion, r for correlation and b for gradient.

Hypothesis testing

Lets understand this with an example.

Example: Did dieters lose more fat than the exercisers? We are given certain numbers as below.

Diet Only:

sample mean = 5.9 kg

sample standard deviation = 4.1 kg

sample size = n = 42

standard error = SEM1 = 4.1/ √42 = 0.633

Exercise Only:

sample mean = 4.1 kg

sample standard deviation = 3.7 kg

sample size = n = 47

standard error = SEM2 = 3.7/ √47 = 0.540

measure of variability = [(0.633)2 + (0.540)2] = 0.83

Step 1: Determine the null and alternative hypotheses.

Null hypothesis: No difference in average fat lost in population for two methods. Population mean difference is zero.

Alternative hypothesis: There is a difference in average fat lost in population for two methods. Population mean difference is not zero.

Step 2. Collect and summarize data into a test statistic.

The sample mean difference = 5.9–4.1 = 1.8 kg

The standard error of the difference is 0.83.

So the test statistic: z = (1.8–0)/0.83 = 2.17

Step 3. Determine the p-value.

Recall the alternative hypothesis was two-sided. p-value = 2 × [proportion of bell-shaped curve above 2.17]

proportion is about 2 × 0.015(this value comes from a standard table) = 0.03.

Step 4. Decide.

The p-value of 0.03 is less than or equal to 0.05, so …

• If really no difference between dieting and exercise as fat loss methods, would see such an extreme result only 3% of the time, or 3 times out of 100.

• Prefer to believe truth does not lie with null hypothesis. We conclude that there is a statistically significant difference between average fat loss for the two methods.

Congratulations, you did it.

For now, thank you all for making it this far. We covered basics of hypothesis tests and the bell curve. We will deep dive into various types of distributions and their terminologies.

And as always, if there are any question, remarks, or comments feel free to contact me!

Reference :

Statistics How To

https://www2.stat.duke.edu/courses

--

--

Parmita Biswas
Analytics Vidhya

I am an enthusiast data scientist as well as a python developer. I have an overall ten years of industry experience.