A few tricks in probability and statistics (3): Poissonization trick

some tricks are well-known while some are less

Let us begin with a well-known brain teaser:

The Coupon Collector Problem. Given $m$ distinct coupons, in expectation, how many coupons do you need to draw (with replacement) before seeing each distinct one at least once?

Answer: By the linearity of expectation,

\[\begin{align*} & \text{Expected number of draws to see $m$ different coupons} \\ &= \text{Expected number of draws to see the first coupon} \\ &+ \text{Expected number of draws to see a coupon different from the first one} \\ &+ \text{Expected number of draws to see a coupon different from the first two} \\ &+ \cdots \\ &+ \text{Expected number of draws to see a coupon different from the first $m-1$} \\ &= 1 + \frac{m}{m-1} + \frac{m}{m-2} + \cdots + \frac{m}{1} \\ &= \boxed{m \left( \frac{1}{m} + \frac{1}{m-1} + \frac{1}{m-2} + \cdots + \frac{1}{1} \right) = m H_m,} \end{align*}\]

where $H_m$ is the $m$-th harmonic number. It has a well-known approximation $H_m = \ln m + \gamma + \frac{1}{2m} + O(\frac{1}{m^2})$, where $\gamma \approx 0.5772$ is the Euler constant. Hence, the expected number of draws is roughly $m \ln m + \gamma m + \frac{1}{2}$.


Poissonization

Poissonization is a useful trick to tackle complex situations involving iid draws from Bernoulli or categorical distributions (e.g., collecting coupons). With $n$ draws, the outcomes follow binomial or multinomial distributions, which can be nontrivial to handle. Poissonization states that the outcomes for different categories are independent and each follows a Poisson distribution. The independence is a powerful tool to analyze outcomes. Moreover, heuristically, Poisson distribution has a “concentrated” mean, because the standard deviation is only the square root of it. Hence, treating $n$ to be Poisson rather than a fixed value often gives us nice approximations to some of the probabilities we look for, when $n$ is large.

Theorem. Let $N$ follows a Poisson distribution with rate $\lambda$ (that is, $N \sim \text{Pois}(\lambda)$) and suppose that given any realization $N=n$, the multivariate random variable $X = (X_1, \ldots, X_m)$ follows a $m$-category, $n$-trial, multinomial distribution: $X \sim \text{Mult}(n; p_1, \ldots, p_m)$, where $p_1, \ldots, p_m$ are success probabilities for each category, summing to unity. Then, marginally, $X_1, \ldots, X_m$ are independent with $X_k \sim \text{Pois}(\lambda p_k)$ for $k=1, \ldots, m$.

Proof. A direct calculation gives

\[\begin{align*} \underbrace{\Pr(X_1=n_1, \ldots, X_m=n_m)}_{\text{joint of } X_1, \ldots, X_m} &= \sum_{n=0}^{\infty} \underbrace{\Pr(N=n)}_{\text{Poisson for } N} \cdot \underbrace{\Pr\left(X_1=n_1, \ldots, X_m=n_m \left \vert \sum_{k=1}^m n_i = n \right. \right)}_{\text{Binomial for } X_1, \ldots, X_m} \\ &= \sum_{n=0}^{\infty} \frac{\lambda^n e^{-\lambda}}{n!} \cdot \frac{(n_1+\cdots+n_m)!}{n_1!\cdots n_m!} p_1^{n_1}\cdots p_m^{n_m} \cdot I(n_1+\cdots+n_m=n) \\ &= \frac{\lambda^n e^{-\lambda}}{n!} \cdot \frac{n!}{n_1!\cdots n_m!} p_1^{n_1}\cdots p_m^{n_m} \\ &= \prod_{k=1}^m \frac{(\lambda p_k)^{n_k} e^{-\lambda p_k}}{n_k!}. \end{align*}\]

For each $k$, $\frac{(\lambda p_k)^{n_k} e^{-\lambda p_k}}{n_k!}$ is the pmf of $\text{Pois}(\lambda p_k)$. Hence, the joint distribution of $X_1, \ldots, X_m$ is the product of $\text{Pois}(\lambda p_k)$. Hence, the marginal for each $X_k$ must be $\text{Pois}(\lambda p_k)$ and they are independent. $\qquad\square$

The theorem, particularly its proof technique, gives us a mechanism to derive probabilities given a specific $n$. All we need to do is to write down the product of the marginals, which are free of $n$, and reverse engineer the $n$-dependent probability $P_{n,m}$ from “product of marginals = $\sum_{n=0}^{\infty} \frac{\lambda^n e^{-\lambda}}{n!} \cdot P_{n,m}$”. An example will be shown in the next section.

Besides the rigorous application of Poissonization, I want to highlight a heuristic treatment proposed by Matthew Aldridge. The treatment gives us only approximate answers, but the approximations are generally quite good. The treatment is to recast the problem for a fixed number of trials, $n$, to one that treats $n$ to be Poisson with rate $\lambda=n$. Because $\text{Pois}(\lambda)$ has a mean $\lambda$ and a small standard deviation $\sqrt{\lambda}$, what happens to a fixed $n$ can be approximated by the average outcomes when varying $n$ based on the Poisson distribution $\text{Pois}(n)$. This approximation is particularly good for large $n$. We will illustrate this in an application discussed in the subsequent sections.


Applying Poissonization in a rigorous manner

We consider the following problem:

Application of Poissonization. Given $m$ distinct coupons and $n$ draws with replacement, what is the probability of seeing each distinct coupon at least once?

Answer: Since the outcome for each distinct coupon follows $\text{Pois}(\lambda/m)$ and they are independent, the probability of seeing each distinct coupon at least once is

\[\prod_{k=1}^m \underbrace{\Pr(X_k>0)}_{X_k \sim \text{Pois}(\lambda/m)} = (1 - e^{-\lambda/m})^m.\]

This probability is marginalized over all possible $n$. Let $P_{n,m}$ denote the probability given a specific $n$. Following the proof technique of the above theorem, we have:

\[(1 - e^{-\lambda/m})^m = \sum_{n=0}^{\infty} \frac{\lambda^n e^{-\lambda}}{n!} \cdot P_{n,m}.\]

To extract $P_{n,m}$, we multiply $e^{\lambda}$ to both sides and perform Taylor expansion on the left-hand side; then, we match each term of the Taylor expansion. To proceed, write

\[e^{\lambda} (1 - e^{-\lambda/m})^m = (e^{\lambda/m} - 1)^m = \sum_{k=0}^m \binom{m}{k} (-1)^k e^{(\lambda/m)(m-k)} = \sum_{k=0}^m \binom{m}{k} (-1)^k \sum_{n=0}^{\infty} \frac{(\lambda(1-\frac{k}{m}))^n}{n!}.\]

Rearranging terms, we see

\[\sum_{k=0}^m \binom{m}{k} (-1)^k \sum_{n=0}^{\infty} \frac{(\lambda(1-\frac{k}{m}))^n}{n!} = \sum_{n=0}^{\infty} \frac{\lambda^n}{n!} \sum_{k=0}^m (-1)^k \binom{m}{k} \left(1-\frac{k}{m}\right)^n,\]

which readily leads to

\begin{equation}\label{eqn:P} \boxed{P_{n,m} = \sum_{k=0}^m (-1)^k \binom{m}{k} \left(1-\frac{k}{m}\right)^n.} \end{equation}


Applying Poissonization in a heuristic manner

The above problem can have a quicker answer, inspired by Matthew Aldridge: Assume that the number of draws follows $\text{Pois}(n)$. Then, the outcome for each distinct coupon follows $\text{Pois}(n/m)$ and thus the probability of seeing each distinct coupon at least once is

\[\boxed{\widetilde{P}_{n,m} = \prod_{k=1}^m \underbrace{\Pr(X_k>0)}_{X_k \sim \text{Pois}(n/m)} = (1 - e^{-n/m})^m.}\]

The rationale here is very similar to the rigorous approach in the preceding section. The difference is that the rigorous approach introduces $\text{Pois}(\lambda)$ with an arbitrary $\lambda$, which eventually disappears by equating two Taylor series; whereas the heuristic approach here directly treats $\lambda = n$.

We can convince ourselves that $\widetilde{P}_{n,m}$ is a good approximation by performing binomial expansion:

\[\widetilde{P}_{n,m} = (1 - e^{-n/m})^m = \sum_{k=0}^m \binom{m}{k} (-e^{-n/m})^k = \sum_{k=0}^m (-1)^k \binom{m}{k} e^{-nk/m}.\]

When $n$ is large, $e^{-nk/m}$ is very close to $(1-\frac{k}{m})^n$ for all choices of $k/m$. On the other hand, when $n < m$, $P_{n,m}$ must be zero while $\widetilde{P}_{n,m} \le (1-e^{-1})^n$, which is also close to zero unless $n$ is too small.


How good is the approximation? (Alternative title: Nice formulas can be a numerical nightmare)

We want to numerically evaluate the goodness of the heuristic approach. While the approximate probability is straightforward to compute, the exact one $P_{n,m}$ is not. It is an alternating sum and some of the terms with opposite signs can be exceedingly large in magnitude. The typical behavior of $\binom{m}{k} \left(1-\frac{k}{m}\right)^n$ is that it grows rapidly as $k$ increases from zero, then reaches a peak and decays rapidly as well. The peak value can be exceedingly large compared with the case $k=0$; say, $10^{22}$, compared with $1$. This means that computing $P_{n,m}$ by iteratively summing the terms in \eqref{eqn:P} from $k=0$ to $k=m$ will incur catastrophic cancellations, a notorious numerical behavior. A numerical example at the end of this post shows that the probability $P_{1000,480}$ evaluates to $3.2\times10^7$ by following the formula \eqref{eqn:P}, while the actual value is $2.8\times10^{-34}$.

Thus, we seek alternative ways to compute $P_{n,m}$. The trick resorts to the Stirling numbers of the second kind:

\[\left\{ n \atop m \right\} = \sum_{k=0}^m (-1)^{m-k} \binom{m}{k} \frac{k^n}{m!}.\]

By using these numbers, we can readily write

\[P_{n,m} = \left\{ n \atop m \right\} \frac{m!}{m^n}.\]

The reason to use these numbers is that they admit a nice recurrence relation:

\[\left\{ n+1 \atop m \right\} = m \left\{ n \atop m \right\} + \left\{ n \atop m-1 \right\} \quad\text{with initial condition}\quad \left\{ 0 \atop 0 \right\} = 1 \text{ and } \left\{ n \atop 0 \right\} = \left\{ 0 \atop m \right\} = 0 \text{ for } m,n > 0.\]

This in turn produces a recurrence for $P_{n,m}$:

\[P_{n+1,m} = P_{n,m} + (1 - m^{-1})^n P_{n,m-1} \quad\text{with initial condition}\quad P_{0,0} = 1 \text{ and } P_{n,0} = P_{0,m} = 0 \text{ for } m,n > 0.\]

The recurrence is in the form of dynamic programming. It suffices to precompute a table to store all the needed $P_{n,m}$. The computation of the table uses a nested for loop:

1
2
3
4
5
Initialize an (N+1)-by-(M+1) empty table P
Set P[0][0] = 1
for n = 0 ... N-1
    for m = 1 ... min(M, n+1)
        P[n+1][m] = P[n][m] + (1-1/m)^n * P[n][m-1]

With the dynamic programming approach, we can now compute $P_{n,m}$ in a numerically stable manner and compare it with $\widetilde{P}_{n,m}$.

In the following Jupyter notebook, we set $m=480$ and plot $P_{n,m}$ and $\widetilde{P}_{n,m}$ for $0 \le n < 6000$. The two curves are visually indistinguishable. You can download the Jupyter notebook here.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • How to get the sorted array and indices in Python without importing additional modules like Numpy
  • A subtly wrong proof that ChatGPT generated
  • Distances between two multivariate normals
  • A few tricks in probability and statistics (2): Poisson estimator trick
  • A few tricks in probability and statistics (1): log-derivative trick