Jargons /Concepts from Probability and Statistics for a Data Science Interview

Sample Space:

Set of all possible outcomes of a random experiment.
For a die-rolling experiment, we can write the sample space as S = {1, 2, 3, 4, 5, 6}

Probability:

Probability is how likely something is to happen.
Probability of an event = (No. of ways it can happen) /
(total number of outcomes)
Example1: Flip a coin
There are 2 possible outcomes—heads or tails. So, What’s the probability of getting Heads?
P(H) = 1/2 = 50%

Example2: Roll a Dice
There are 6 possible outcomes –  1,2,3,4,5,6

So, what’s the probability of rolling a 1?
P(1) = 1/6
What’s the probability of rolling a one OR a six?
P(1 or 6) = 1/6 + 1/6 = 2/6 = 1/3

Independent Events:

In probability, two events are independent if the occurrence of one event does not affect the outcome of the other event. The formula for finding the probability of independent events:

P(A and B) = P(A) * P(B)
Example: Flip a coin and roll a dice simultaneously. These 2 events are independent of each other. Now, let’s answer a question:

Q: What is the probability of rolling a number less than 5 and getting Tails?
P(no. <5 and T) = p(no. <5) *  P(T)
= 4/6 * 1/2
= 2/3 * 1/2
= 1/3

Dependent Events:

If the occurrence of one event does affect the outcome of the other event, then the events are dependent.
Example: Draw a card from a deck of 52 cards without replacement 2 times on a row. The outcome of the second event is dependent on the first event.

Descriptive Statistics:

Descriptive statistics involves summarizing and describing the data in a sample so they can be easily understood.
There are 2 categories:
1. Measures of central tendency – Mean, Median, Mode
2. Measures of variability (spread) – Standard Deviation, Range, Variance, Percentile

Inferential Statistics:

Inferential statistics takes data from a sample and makes inferences about the larger population from which the sample was drawn.
The most common methodologies in inferential statistics are hypothesis tests, confidence intervals, and regression analysis.

Hypothesis Testing:

Hypothesis testing is the use of statistics to determine the probability that a given hypothesis is true. The process consists of the below steps:

1. State the Null Hypothesis(H0)

2. State an Alternate Hypothesis (Ha)

3. Identify a Test Statistic to asses the truth of the Null Hypotheses

4. Decide on the significance level or alpha value (usually 5%)

5. If the test statistic computation value is greater than alpha value, reject the Null Hypothesis.

Type I Error:

The error of rejecting a null hypothesis when it is actually true.

Type II Error:

The error of not rejecting a null hypothesis when the alternative hypothesis is true.

Summary:

Action
Reality
H0 True
H0 False
Reject H0
Type 1 Error
Correct Conclusion
Fail To Reject H0
Correct Conclusion
Type II Error

Example(Khan Academy): A large nationwide poll recently showed an unemployment rate of 9% in the US. The Mayor of a local town wonders if the national result holds true in her town. So she plans on taking a sample of her residents to see if the unemployment rate is significantly different than 9% in her town.

H0: The Unemployment rate is 9% in her town ( H0: p = 0.09)
Ha: The Unemployment rate is NOT 9% in her town  (Ha: p != 0.09)

Under which conditions, the Mayor would commit a Type1 Error and a Type2 Error?

Answer:
TypeI error: If the Mayor concludes that the unemployment of the town is NOT 9% when actually it is true.
Type II error: If the Mayor concludes that the unemployment of the town is 9% when actually it is NOT true.

P-Value:

When we perform a hypothesis test in statistics, a p-value helps to determine the significance of the results.

  • A small p-value (typically ≤ 0.05) indicates strong evidence against the null hypothesis, so you reject the null hypothesis.
  • A large p-value (> 0.05) indicates weak evidence against the null hypothesis, so you fail to reject the null hypothesis.

Z -Test:

– Statistical calculations that can be used to compare the population mean to the sample mean
– Deals with large samples (n >= 30)
– Usually used when Standard Deviation is known
z = (x — μ) /  (σ / √n)
where
x= sample mean
μ = population mean
σ / √n = population standard deviation

T-Test:

  • A t-test is used to compare the mean of two given samples
  • Deals with a limited sample size (n <30)
  • A t-test is used when the population parameters (mean and standard deviation) are not known.
    t = (x1 — x2) / (σ / √n1 + σ / √n2)
    where
    x1 = mean of sample 1
    x2 = mean of sample 2
    n1 = size of sample 1
    n2 = size of sample 2

Central Limit Theorem:

The Central Limit Theorem states that the sampling distribution of the sample means approaches a normal distribution as the sample size gets larger — no matter what the shape of the population distribution.

Example: There are 10 sections in the science department of a university.  Each section has 100 students. Calculate the average weight of students in the science department.

Approach1:

  • First, measure the weights of all the students in the science department
  • Add all the weights
  • Finally, divide the total sum of weights with a total number of students to get the average

Approach2:

  • First, draw groups of students at random from the class. We will call this a sample.
  • Draw multiple samples, each consisting of 30 students.
  • Calculate the individual mean of these samples
  • Calculate the mean of these sample means
  • This value will give us the approximate mean weight of the students in the science department

Thus, as per CL Theorem, the distribution of sample means, calculated from repeated sampling, will tend to normality as the size of your samples gets larger.

1

Leave a Reply

Your email address will not be published. Required fields are marked *