Rajesh K
GoPenAI
Published in
7 min readJan 24, 2024

--

In statistics and probability theory, a probability distribution describes the likelihood of different outcomes in an experiment or random process.

Essentially, a probability distribution is a mathematical description of a random variable, showing the probabilities of it taking on different possible values. It assigns probabilities to events, indicating the likelihood of each event occurring. Probability distributions can be categorized into two main types: discrete and continuous.

In machine learning and data science, probability distributions play a crucial role in understanding and modeling uncertainty within data. They describe the probabilities of different outcomes or values occurring for a random variable. Choosing the right distribution is vital for building accurate and effective models.

Discrete Distributions:

Definition: A discrete probability distribution lists each possible outcome and the probability of that outcome occurring.Here are some of the most commonly used probability distributions in these fields:

Bernoulli Distribution:

This models events with only two possible outcomes,

Formula: Models events with only two possible outcomes, represented by a single parameter p (probability of success).

In a single coin toss, where p is the probability of getting heads, the Bernoulli distribution can be used to model the outcome.

Application Examples:

  • Classifying binary data like spam/not spam, click/no click, head/tails.
  • Analyzing success rates in Bernoulli trials, like coin flips or yes/no surveys.

Code

import numpy as np
import matplotlib.pyplot as plt

# Define probability of success
p = 0.7

# Number of trials
n = 1000

# Generate random samples (0 for failure, 1 for success)
samples = np.random.binomial(1, p, size=n)

# Count successes and failures
successes = np.sum(samples)
failures = n - successes

# Calculate probabilities
p_success = successes / n
p_failure = 1 - p_success

# Create bar chart
plt.bar(["Success", "Failure"], [p_success, p_failure])
plt.xlabel("Outcome")
plt.ylabel("Probability")
plt.title("Bernoulli Distribution (p={})".format(p))
plt.ylim(0, 1) # Set y-axis limits
plt.show()

Binomial Distribution:

This builds upon the Bernoulli, describing the number of successes in a fixed number of independent Bernoulli trials.

Formula:

P(k successes in n trials) = (n choose k) * p^k * (1-p)^(n-k), where n is the number of trials, k is the number of successes, p is the probability of success, and (n choose k) is the binomial coefficient (number of combinations of k successes in n trials).

Application Examples:

  • Modeling the number of website conversions in a fixed campaign period.
  • Analyzing successful outcomes in clinical trials with multiple participants.

Code

import numpy as np
import matplotlib.pyplot as plt
from scipy.stats import binom

n = 10 # Number of trials
p = 0.6 # Probability of success
x = np.arange(n + 1) # Possible number of successes
y = binom.pmf(x, n, p) # Probability mass function (PMF)

plt.bar(x, y)
plt.xlabel("Number of Successes")
plt.ylabel("Probability Mass")
plt.title("Binomial Distribution (n = {}, p = {})".format(n, p))
plt.show()

Poisson Distribution:

This models the number of occurrences of an event in a fixed interval of time or space, like the number of customers arriving at a shop in an hour.

Formula

If events occur on average 4 times per hour (λ=4), the Poisson distribution can be used to model the probability of a specific number of events (X=k) occurring in an hour.

Application Examples:

  • Predicting customer arrivals at a store or online traffic to a website.
  • Analyzing defect rates in manufacturing processes.

code

import numpy as np
import matplotlib.pyplot as plt

lam = 5 # Average number of events
x = np.arange(lam + 1) # Possible number of events
y = np.exp(-lam) * lam**x / np.vectorize(np.math.factorial)(x) # PMF

plt.bar(x, y)
plt.xlabel("Number of Events")
plt.ylabel("Probability Mass")
plt.title("Poisson Distribution (lambda = {})".format(lam))
plt.show()

Multinomial Distribution:

This generalizes the Binomial for cases with more than two possible outcomes, like rolling a die or classifying data into multiple categories.

Formula

Application Examples:

  • Classifying text data into multiple categories like sports, politics, entertainment.
  • Modeling customer choices with various product options.

Code

import numpy as np
import matplotlib.pyplot as plt

# Define parameters
n = 10 # Number of trials
p = np.array([0.5, 0.3, 0.2, 0.1]) # Probabilities for each event (sum to 1)
k = len(p) # Number of possible outcomes

# Generate random samples
samples = np.random.multinomial(n, p, size=1000)

# Count occurrences of each outcome
outcomes, counts = np.unique(samples, return_counts=True, axis=1)

# Calculate relative frequencies
frequencies = counts / samples.sum(axis=1)[:, None]

# Create bar chart
plt.bar(np.arange(k), frequencies[0])
plt.xlabel("Event Outcome")
plt.ylabel("Relative Frequency")
plt.title("Multinomial Distribution (n={}, p={})".format(n, p))
plt.xticks(np.arange(k), ["Outcome {}".format(i+1) for i in range(k)])
plt.show()

Continuous Distributions:

Definition: A continuous probability distribution is defined for continuous random variables. It describes the probabilities of intervals rather than individual values.

Normal Distribution:

Also known as the Gaussian distribution, this bell-shaped curve is the most common distribution in natural phenomena and data analysis. It models continuous variables with central tendency and varying degrees of spread.

Formula

Application Examples:

  • Modeling continuous data like heights, weights, stock prices.
  • Building regression models to predict real-valued outcomes.

Code

import numpy as np
import matplotlib.pyplot as plt

# --- Normal Distribution ---
mu, sigma = 50, 10
x = np.linspace(mu - 3*sigma, mu + 3*sigma, 1000)
y = 1 / (sigma * np.sqrt(2*np.pi)) * np.exp(-(x-mu)**2 / (2*sigma**2))
plt.plot(x, y)
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.title("Normal Distribution (mu = {}, sigma = {})".format(mu, sigma))
plt.show()

Uniform Distribution:

This describes data with equal probability for all values within a specific range, like random numbers generated within a certain interval.

Formula

Application Examples:

  • Generating random numbers for simulations or experiments.
  • Modeling data where all values within a range are equally likely.

Code

import numpy as np
import matplotlib.pyplot as plt

a = 0 # Lower bound
b = 10 # Upper bound
x = np.linspace(a, b, 1000)
y = np.where((x >= a) & (x <= b), 1 / (b - a), 0) # Uniform PDF

plt.plot(x, y)
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.title("Uniform Distribution (a = {}, b = {})".format(a, b))
plt.show()

Exponential Distribution:

This models the time between occurrences of independent events, like the time between customer arrivals or device failures.

Formula

Application Examples:

  • Analyzing time between customer purchases or device failures.
  • Modeling waiting times in queues or call centers.

Code

import numpy as np
import matplotlib.pyplot as plt

lam = 0.5 # Rate parameter
x = np.linspace(0, 10, 1000)
y = lam * np.exp(-lam * x) # PDF

plt.plot(x, y)
plt.xlabel("Time")
plt.ylabel("Probability Density")
plt.title("Exponential Distribution (lambda = {})".format(lam))
plt.show()

Gamma Distribution:

This flexible distribution can model skewed data and is often used as a prior distribution in Bayesian statistics.

Formula

Application Examples:

  • Modeling rainfall amounts or insurance claim sizes.
  • As a prior distribution in Bayesian statistics for analyzing rates or proportions.

Code

        import numpy as np
import matplotlib.pyplot as plt
from scipy.special import gamma

k, theta = 2, 3 # Shape and scale parameters
x = np.linspace(0, 10, 1000)
y = x**(k-1) * np.exp(-x/theta) / (gamma(k) * theta**k) # PDF

plt.plot(x, y)
plt.xlabel("Value")
plt.ylabel("Probability Density")
plt.title("Gamma Distribution (k = {}, theta = {})".format(k, theta))
plt.show()

This is not an exhaustive list, but it covers the main distributions you’ll encounter in most data science and machine learning tasks. Choosing the right distribution depends on the type of data and the nature of your problem. Understanding their properties and applications will significantly enhance your ability to analyze and model data effectively.

--

--