Statistical functions and Statistical Distributions

Table of contents

Mohamed Niang
11 min readMay 20, 2021

Statistical functions

Averages and measures of central location

Mean

The mean is the sum of all values divided by the number of values.

import statistics as st

x = [4,5,1,2,7,2,6,9,3,7,2]

print('Mean:', st.mean(x))
Mean: 4.363636363636363

##1) ✍ Implement mean from scratch.

def mean(X):
sum = 0
for i, n in enumerate(X):
sum += X[i]
return sum/(i + 1)

print('Mean:', mean(x))
Mean: 4.363636363636363x[4, 5, 1, 2, 7, 2, 6, 9, 3, 7, 2]sum = 0sum(x)48

Median

The median is the middle value in the list of numbers.

print('Median:', st.median(x))Median: 4

##2) ✍ Implement median from scratch.

def median(X):
N = len(X) #0 1 2 3 4 5 6 7
#X.sort()
X = sorted(X)
if N%2 != 0:
return X[N//2]
else:
return round((X[N//2] + X[N//2 - 1])/2, 1)

print('Median:', median(x))
Median: 4

Mode

The mode is the value that occurs most often. If no number in the list is repeated, then there is no mode for the list.

x = [1,2,3,3,5]
print(x)
print('Mode:', st.mode(x))
[1, 2, 3, 3, 5]
Mode: 3

✍ 3) Implement mode from scratch.

#mode = max(set(x), key=x.count)

def mode(X):
x = [(i, X.count(i)) for i in X]
y = list(set(x))

y.sort(key=lambda y: y[0], reverse=True)
y.sort(key=lambda y: y[1], reverse=True)

return(y[0][0])


x = [1, 2, 2, 3]
print('Mode:', mode(x))
Mode: 2X = [2,2,2,3,3,5,5,5,5,8,8]
X.sort()

count = 1
count_number = X[0]
count_max = 1
for i in range(1,len(X)):
if X[i] == X[i-1]:
count = count + 1
if count > count_max:
count_max = count
count_number = X[i]
else:
count = 1

print(count_number)
5from collections import Counter
Counter(X).most_common()[0][0]
5l = [2,5,4,8,3,8,6,8,6]
s = list(set(l))
c = []
for i in s:
c.append(l.count(i))
d = {k:v for k, v in zip(s,c)}
for k in d.keys():
if(d[k] == max(c)):
print(k)
8

Measures of Spread

Range

Range is the differene between the largest data value and the smallest data value in the set.

import numpy as np

x = [1, 4, 8]
print(x)
print('Range:', np.ptp(x))
[1, 4, 8]
Range: 7

##4) ✍ Implement range from scratch.

np.max(x) - np.min(x)7def my_range(l):
l_min, l_max = float("inf"), -float("inf")
for elem in l:
l_min = min(l_min, elem)
l_max = max(l_max, elem)

return l_max - l_min
my_range([1,2,4,3])3

Interquartile Range (IQR)

The interquartile range (IQR), also called midspread, is the length of the middle 50% of data interval. The IQR is calculated as the difference between the third or upper quartile ($Q_3$) and the first or lower quartile ($Q_1$).

import numpy as np
from scipy.stats import iqr

#x = [4,5,1,2,7,2,6,9,3,7,2]
x = [1, 4, 5, 6, 6, 7, 11, 12, 15, 17]

print('IQR of data:', iqr(x, ))
IQR of data: 6.5q3, q1 = np.percentile(x, [75 , 25], interpolation = "nearest")
print('IQR of data:', q3 - q1)
IQR of data: 7print(q1,q3)5 12

##5) ✍ Implement IQR from scratch.

def median(X, N):

X = sorted(X)
if N%2 != 0:
return X[N//2]
else:
return (X[N//2] + X[N//2 - 1])/2.

def Interquartile_Range(X):

N = len(X)

X.sort()

Q1 = median(X[:N//2], N//2)
print(Q1)

if N%2 == 0:
Q3 = median(X[N//2:], N//2)
else:
Q3 = median(X[N//2+1:], N//2)
print(Q3)

return (Q3 - Q1)

x = [1, 4, 5, 6, 6, 7, 11, 12, 15, 17]
print("%0.1f" % Interquartile_Range(x))
5
12
7.0

Variance

Variance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value.

print(x)
print('Population variance:', st.pvariance(x))

print('Sample variance:', st.variance(x))
[1, 4, 5, 6, 6, 7, 11, 12, 15, 17]
Population variance: 23.64
Sample variance: 26.266666666666666

Standard Deviation

The standard deviation is the square root of its variance.

?np.stdprint('Population standard deviation:', st.pstdev(x))

print('Sample standard deviation:', st.stdev(x))
Population standard deviation: 4.862098312457287
Sample standard deviation: 5.125101625008686
print('Population standard deviation:',np.std(x, ddof=0)) # ddof: deduct degrees of freedom

print('Sample standard deviation:', np.std(x, ddof=1))
Population standard deviation: 4.862098312457287
Sample standard deviation: 5.125101625008686

6) ✍ Implement population standard deviation from scratch.

def mean(X, N):
sum = 0
for i in range(len(X)):
sum += X[i]
return sum/N

def standard_deviation(X):
N = len(X)

mu = mean(X, N)

tmp = 0
for i in range(N):
tmp += (X[i] - mu)**2

return (tmp/N)**0.5

x = [1,2,3,4,5,6]
standard_deviation(x)
1.707825127659933

Skewness

Skewness [Ref] is a measure of symmetry, or more precisely, the lack of symmetry. A distribution, or data set, is symmetric if it looks the same to the left and right of the center point.

  • The skewness for a normal distribution is zero, and any symmetric data should have a skewness near zero.
  • Negative values for the skewness indicate data that are skewed left and positive values for the skewness indicate data that are skewed right. By skewed left, we mean that the left tail is long relative to the right tail.
  • Similarly, skewed right means that the right tail is long relative to the left tail.
  • If the data are multi-modal, then this may affect the sign of the skewness.
from scipy.stats import skew
import scipy.stats as st
#data = np.random.normal(0, 1, 1000000)
data = np.random.uniform(0, 1, 100000)

print("skewness:",st.skew(data))
skewness: -0.0030195334962366182

Kurtosis

Kurtosis is a measure of whether the data are heavy-tailed or light-tailed relative to a normal distribution. That is, data sets with high kurtosis tend to have heavy tails, or outliers. Data sets with low kurtosis tend to have light tails, or lack of outliers. A uniform distribution would be the extreme case.

from scipy.stats import kurtosis

print("kurtosis:", kurtosis(data))
kurtosis: -1.198085296060503

Statistical distributions

Discrete distributions

Discrete uniform distribution

The discrete uniform distribution is a symmetric probability distribution whereby a finite number of values are equally likely to be observed. Its probability mass function is $$ p(x) = \frac{1}{b — a} $$ for $x = a, …, b — 1$.

Python syntax:

scipy.stats.randint(low, high+1)

np.random.randint(low, high+1)

from scipy.stats import randint

low = 3
high = 8
X = randint(low, high + 1) # Declare X to be a uniform random variable
X.rvs(size= 10) # Get 10 random samples form X
array([8, 6, 7, 6, 6, 5, 4, 3, 5, 7])?sns.distplotimport seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

def plot_dist(X, title):
ax = sns.distplot(X, kde=False, color='crimson', hist_kws={"linewidth": 15,'alpha':1})
ax.set(xlabel= 'x', ylabel='Frequency', title = title);

X = randint.rvs(low, high + 1, size= 10000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Uniform Distribution')
skewness: -0.019059873221003085
kurtosis: -1.2545266059524007


/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

Bernoulli distribution

A Bernoulli trial is an experiment that has two outcomes that can be encoded as success ($x = 1$) or failure ($x = 0$). The Bernoulli Probability Mass Function (PMF) is $$ p(x) = {p^x}{(1 — p)^{(1-x)}} $$ where $x \in {0, 1}$.

Python syntax:

scipy.stats.bernoulli(theta)

np.random.choice([0, 1], p=[1-theta, theta])

from scipy.stats import bernoulli

p = 0.25 # probability of success
X = bernoulli(p) # Declare X to be a bernoulli random variable
X.rvs(10) # Get 10 random samples form X
array([0, 0, 0, 0, 1, 1, 0, 0, 0, 0])print('P(X = 0) =', X.pmf(0))
print('P(X = 1) =', X.pmf(1))
print('P(X <= 0.5) =', X.cdf(0.5)) # P(X<= 0.5) = P(X = 0) = .75
print('E[X] =', X.mean())
print('Var(X) =', X.var())
print('Std(X) =', X.std())
P(X = 0) = 0.75
P(X = 1) = 0.25
P(X <= 0.5) = 0.75
E[X] = 0.25
Var(X) = 0.1875
Std(X) = 0.4330127018922193
import seaborn as sns

X = bernoulli.rvs(p= 0.6, size= 1000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Bernoulli Distribution')
skewness: -0.37860678521640406
kurtosis: -1.8566569021880999


/usr/local/lib/python3.6/dist-packages/seaborn/distributions.py:2551: FutureWarning: `distplot` is a deprecated function and will be removed in a future version. Please adapt your code to use either `displot` (a figure-level function with similar flexibility) or `histplot` (an axes-level function for histograms).
warnings.warn(msg, FutureWarning)

Binomial Distribution

The binomial distribution is used to obtain the probability of observing $x$ successes in $n$ trials, with the probability of success on a single trial denoted by $p$. The binomial distribution assumes that $p$ is fixed for all trials. The formula for the binomial probability mass function is $$ p(x) = {n\choose x} {p^x}{(1 — p)^{(n-x)}} $$ where $x \in {0, 1,…, n}$, and ${n\choose x}$ is the binomial coefficient that calculates different ways of distributing $x$ successes in a sequence of $n$ trials.

Python syntax:

scipy.stats.binom(N, theta)

np.random.binomial(N, theta)

from scipy.stats import binom

n = 7 # Number of trials (number of coins)
p = 0.5 # Probability of success in each trial
sample_size = 5 # Sample size (number of experiments)

X = binom.rvs(n = n, p = p, size = sample_size)
X
array([6, 4, 2, 3, 2])x = 2 # Number of successes sought
X = binom(n = n, p = p)

print('P(X = %d) = %f' %(x, X.pmf(x)))
print('P(X <= %d) = %f' % (x, X.cdf(x)))
print('E[X] =', X.mean())
print('Var(X) =', X.var())
print('Std(X) =', X.std())
X = binom.rvs(n = 10, p = 0.3, size= int(1e6))

print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Binomial Distribution')

Geometric distribution

We perform a series of Bernoulli trials with probability of success $p$ until we get a success. The probability that the $x$-th trial (out of $x$ trials) is the first success is $$ p(x) = {(1 — p)^{(x-1)}}{p} $$ where $x \ge 1$.

Python syntax:

scipy.stats.geom(theta)

np.random.geometric(theta)

from scipy.stats import geom

p = 0.2 # probability of success on each trial
X = geom.rvs(p, size= 1000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Geometric Distribution')

Poisson distribution

The Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known constant rate and independently of the time since the last event. The Poisson probability that exactly $x$ successes occur in an interval, with the mean number of successes $\lambda$, is $$ p(x) = exp(-\lambda)\frac{{{\lambda}^x}}{{x!}} $$ where $x \ge 1$.

Python syntax:

scipy.stats.poisson(lam)

np.random.poisson(lam)

from scipy.stats import poisson

lam = 3 # average number of events per interval
X = poisson.rvs(lam, size= 10000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Poisson Distribution')

Continuous distributions

Uniform distribution

Any outcome in a given range has equal probability.

Python syntax:

scipy.stats.uniform(low, high)

np.random.uniform(low, high)

from scipy.stats import uniform

low = 3
high = 8
X = uniform.rvs(low, high - low, size= 10000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Uniform Distribution')

Normal Distribution

The probability density of the normal (Gaussian) distribution is $$ f(x) = \frac{1}{\sqrt{2\pi{{\sigma}²}}} exp(- \frac{(x — \mu)²}{2{\sigma}²}) $$ where $\mu$ is the mean, and ${\sigma}²$ is the variance.

Python syntax:

scipy.stats.norm(mu, sigma)

np.random.normal(mu, sigma)

from scipy.stats import norm
print(1e6)
X = norm.rvs(loc=0, scale=1, size=int(1e6))
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Normal Distribution')

Student’s t distribution

The Student’s t-distribution (or simply the t-distribution) is given by $$ f(x) = \frac{\Gamma(\frac{v+1}{2})}{\sqrt{\pi v} \Gamma(v)} (1 + \frac{x²}{v})^{-\frac{v+1}{2}} $$ where $x$ is a real number, $ν > 0$ is the degrees of freedom parameter, and $\Gamma$ is the gamma function.

Python syntax:

scipy.stats.t(nu, mu, sigma)

mu + sigma * np.random.standard_t(nu)

from scipy.stats import t

df = 3

X = t.rvs(df, size=1000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 't-distribution')

Chi-square distribution

The chi-square distribution with $k$ degrees of freedom is the distribution of a sum of the squares of $k$ independent standard normal random variables. Its PDF is $$ f = \frac{1}{2^{\frac{k}{2}} \Gamma(\frac{k}{2})} x^{\frac{k}{2} — 1} exp(-\frac{x}{2}) $$ where $x>0$ and $k>0$.

Python syntax:

scipy.stats.chi2(nu)

np.random.chisquare(nu)

from scipy.stats import chi2

df = 55
X = chi2.rvs(df, size=1000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Chi-square Distribution')

Exponential distribution

The exponential distribution is the probability distribution of the time between events in a Poisson process. The PDF of an exponential distribution is $$ f = \lambda exp(-\lambda x) $$ where $x \ge 0$ and $\lambda > 0$ is the parameter of the distribution, often called the rate parameter.

Python syntax:

scipy.stats.expon(loc=0, scale=1/lambda)

np.random.exponential(1/lambda)

from scipy.stats import expon

X = expon.rvs(loc=0, scale=1, size=1000)
print("skewness:",skew(X))
print("kurtosis:", kurtosis(X))
plot_dist(X, 'Exponential Distribution')

Law of Large Numbers

The law of large numbers (LLN) states that, the average of the results obtained from a large number of trials should be close to the expected value, and will tend to become closer as more trials are performed.

Let ${X_1, X_2, …, X_n}$ be a random sample of size $n$ drawn iid from a distribution with mean $\mu$. According to LLN, we have

$$\bar{X} = \frac{X_1 + X_2 + …+ X_n}{n} \to \mu$$ for large $n$.

For example, a single roll of a fair, six-sided dice produces one of the numbers 1, 2, 3, 4, 5, or 6, each with equal probability. Therefore, the expected value of the average of the rolls is: $$\frac{1+2+3+4+5+6}{6}=3.5$$ According to the law of large numbers, if a large number of six-sided dice are rolled, the average of their values (the sample mean) is likely to be close to 3.5 (the population mean), with the precision increasing as more dice are rolled [Ref]. In the following, we have an illustration of the law of large numbers using a particular run of rolls of a single dice.

np.random.seed(123)

N = 1000 # sample size

rolls = np.random.randint(1, 7, N)

sample_means = []
for i in range(N):
sample_means.append(rolls[:i + 1].mean())

plt.plot(range(1, N+1), 3.5*np.ones(N), 'r', label = 'Population mean')
plt.plot(range(1, N+1), sample_means, 'b', label = 'Sample mean')
plt.xlabel('Sample size (N)')
plt.ylabel("Mean")
plt.legend()
plt.show()
#rolls

7) ✍ Illustrate LLN for samples drawn iid from a normal distribution with mean of 50 and a variance of 10.

?np.random.normalfrom scipy.stats import norm
import numpy as np
import matplotlib.pyplot as plt


X = norm(loc= 50, scale= np.sqrt(10))
mu = X.mean()
var = X.var()
print('Population mean', mu)
print('Population variance', var)

N = 1000 # sample size
x = X.rvs(size = N)#, random_state= 123)

sample_means = []
for i in range(N):
sample_means.append(sum(x[:i + 1])/(i+1))

plt.plot(range(1, N+1), mu*np.ones(N), 'r', label = 'Population mean')
plt.plot(range(1, N+1), sample_means, 'b', label = 'Sample mean')
plt.xlabel('Sample size (N)')
plt.ylabel("Mean")
plt.legend(loc = 'lower right')
plt.show()
Population mean 50.0
Population variance 10.000000000000002

Central Limit Theorem

The central limit theorem (CLT) dictates that the distribution of sample means (calculated across different random samples from our overall population) will be normally distributed. Mathematically, let ${X_1, X_2, …, X_n}$ be a random sample of size $n$ drawn iid from a distribution with mean $\mu$ and variance ${\sigma}²$, and define the sample average of the random variables as $$ \bar{X} = \frac{X_1 + X_2 + … + X_n}{n} $$ The CLT states that $$ \bar{X} \sim N(\mu, \frac{{\sigma}²}{n}) $$ for large $n$.

✍ 8) Test CLT for iid Gamma random variables using scipy.

from scipy.stats import gamma, norm
import numpy as np
import matplotlib.pyplot as plt

a, loc = 2, 3
X = gamma(a, loc)

mu = X.mean()
var = X.var()

print('Population mean:', mu)
print('Population variance:', var)

N = 50 # sample size - corresponding to n
num_trials = 100000 # Number of trials - how many many we want to generate

samples = X.rvs(size = (num_trials, N))
samples_means = samples.mean(axis= 1)

print('Mean of all the sample means:', np.mean(samples_means))
print('Variance of all the sample means =', np.var(samples_means))
print('var/N:', var/N)

plt.hist(samples_means, bins= 100, density=True)

S_N = norm(loc= mu, scale= np.sqrt(var/N))
x = np.linspace(S_N.ppf(0.001), S_N.ppf(0.999), 100)
plt.plot(x, S_N.pdf(x), 'r')
plt.show()
Population mean: 5.0
Population variance: 2.0
Mean of all the sample means: 4.999268307684889
Variance of all the sample means = 0.039875471278788675
var/N: 0.04

9) ✍ Test CLT for iid uniform random variables in range of [5, 10] using numpy.

import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline

shape, scale = 2., 2.
populaiton_size = 100000
s = np.random.uniform(low= 5, high= 10, size = populaiton_size)

mu = np.mean(s) # mu = (high + low)/2
var = np.var(s) # var = (high - low)**2/12

print('Population mean:', mu)
print('Population variance:', var)

plt.hist(s)
plt.show()

N = 20
num_trials = 10000 # Number of trials
samples = [np.mean(np.random.choice(s, size = N)) for _ in range(num_trials)]

print('Mean of all the sample means:', np.mean(samples))
print('Variance of all the sample means:', np.var(samples))
print('var/N:', var/N)


plt.hist(samples, density=True)

def normal_pdf(x, mu, sigma):
return (1/(sigma*np.sqrt(2*np.pi)) *np.exp( - (x - mu)**2 /(2*sigma**2)))

x = np.linspace(6, 9, 100)
plt.plot(x, normal_pdf(x, mu = mu, sigma = np.sqrt(var/N)), 'r')
plt.show()
Population mean: 7.5082694056875
Population variance: 2.0802442142948188
Mean of all the sample means: 7.506555371635119
Variance of all the sample means: 0.1031419760228997
var/N: 0.10401221071474094

--

--