A few tricks in probability and statistics (2): Poisson estimator trick
some tricks are well-known while some are less
How do you estimate $\psi = e^{\theta}$ when you have an unbiased estimator $\hat{\theta}(x)$ for $\theta$ (that is, $\mathbb{E}[\hat{\theta}(x)] = \theta$)?
It is tempting to use $e^{\hat{\theta}(x)}$ as the estimator; however, this estimator is biased because, by Jensen’s inequality,
\[\mathbb{E} [ e^{\hat{\theta}(x)} ] \ge e^{\mathbb{E}[\hat{\theta}(x)]} = e^{\theta},\]with equality holds rarely.
The Poisson trick uses the following estimator instead:
\[\boxed{\hat{\psi} = e^{\lambda} \prod_{i=1}^K \frac{\hat{\theta}(x_i)}{\lambda}, \quad\text{where}\quad K \sim \text{Pois}(\lambda) \text{ and } x_1,\ldots,x_K \text{ are iid}.}\]We can prove that this estimator is unbiased, because
\[\mathbb{E} \left[ e^{\lambda} \prod_{i=1}^K \frac{\hat{\theta}(x_i)}{\lambda} \right] = \sum_{k=0}^{\infty} \Pr(K=k) \cdot e^{\lambda} \prod_{i=1}^k \frac{\mathbb{E}[\hat{\theta}(x_i)]}{\lambda} = \sum_{k=0}^{\infty} \frac{\lambda^k e^{-\lambda}}{k!} \cdot e^{\lambda} \prod_{i=1}^k \frac{\theta}{\lambda},\]which is simplified to $e^{\theta}$.
In fact, a more general form of the estimator was suggested:
\[\boxed{\hat{\psi} = e^{\lambda {\color{red} +c }} \prod_{i=1}^K \frac{\hat{\theta}(x_i) {\color{red} -c }}{\lambda},}\]which can be easily shown to be unbiased as well, for any $c \in \mathbb{R}$, by following the same derivation steps above.
How to set $\lambda$ and $c$?
Is there an optimal choice for $\lambda$ and $c$? Naturally, we want to set them such that they minimize the variance of $\hat{\psi}$. Let us compute the variance:
\[\begin{align*} \text{Var}(\hat{\psi}) = \mathbb{E} \left[ \left( e^{\lambda {\color{red} +c }} \prod_{i=1}^K \frac{\hat{\theta}(x_i){\color{red} -c }}{\lambda} \right)^2 \right] - (e^{\theta})^2 &= \sum_{k=0}^{\infty} \frac{\lambda^k e^{-\lambda}}{k!} \cdot e^{2(\lambda {\color{red} +c })} \prod_{i=1}^k \frac{\mathbb{E}[(\hat{\theta}{\color{red} -c })^2]}{\lambda^2} - (e^{\theta})^2 \\ &= \exp \left( \lambda {\color{red} +2c } + \frac{\mathbb{E}[(\hat{\theta} {\color{red} -c })^2]}{\lambda} \right) - e^{2\theta} \\ &= \exp \left( \lambda {\color{red} +2c } + \frac{\text{Var}(\hat{\theta}) + (\theta - {\color{red} c})^2}{\lambda} \right) - e^{2\theta}. \end{align*}\]The usual approach to obtaining the optimal $\lambda$ and $c$ is to set the partial derivatives of $\lambda {\color{red} +2c } + \frac{\text{Var}(\hat{\theta}) + (\theta - {\color{red} c})^2}{\lambda}$ to zero. However, this leads to no “normal” solutions. It is because when setting the partial derivative with respect to $c$ to zero, we obtain $c = \theta - \lambda$; then, substituting it back, we obtain $ \lambda {\color{red} +2c } + \frac{\text{Var}(\hat{\theta}) + (\theta - {\color{red} c})^2}{\lambda} = 2\theta + \frac{\text{Var}(\hat{\theta})}{\lambda}, $ which is independent of $c$ and monotone with respect to $\lambda$. The degenerate optimal solution is $\lambda = \infty$ and $c = -\infty$.
The result sounds a bit suspicious. Is there anything wrong? No. We see that when $\lambda \to \infty$ but $c = \theta - \lambda$, $\hat{\psi} \to e^{\theta}$ and $\text{Var}(\hat{\psi}) \to 0$. This is entirely self-consistent. It means that if we knew $\theta$, the best approach is to set $\lambda$ to be infinity and let $c = \theta - \lambda$. Of course, we do not know $\theta$ and we cannot let $\lambda$ be infinity.
So, how should we set $\lambda$ and $c$ in practice? First of all, if we have a prior estimate of $\theta$, let us call it $\bar{\theta}$. We may use it to replace the choice of $c$ from $c = \theta - \lambda$ to $c = \bar{\theta} - \lambda$. A prior estimate of $\theta$ can be obtained in many ways. For example, we could estimate $\theta$ once, obtaining $\hat{\theta}(x_1)$; or, we could use the last batch of samples and let $\bar{\theta}$ be the average of this batch.
Next, we let $\lambda$ be a multiple of $\vert \bar{\theta} \vert$; say $m \vert \bar{\theta} \vert$, where $m$ is an integer. Because $\lambda$ is the mean of the Poisson distribution and it (stochastically) determines the number of samples, we set $m$ such that $m \vert \bar{\theta} \vert$ is close to some reasonable number, such as $10$ (in this case, we expect sampling ten times). However, this could be done only when $\vert \bar{\theta} \vert \ll 10$. Nevertheless, it is not a bad assumption to assume the case, because if $\vert \bar{\theta} \vert$ is large, $e^{\theta}$ is either too large or too small, which makes estimating $e^{\theta}$ of little practice sense. Finally, we let $c = \bar{\theta} - m \vert \bar{\theta} \vert$.
Overall, the practical estimator is
\[\psi = e^{\bar{\theta}} \prod_{i=1}^K \left( \frac{\hat{\theta}(x_i)-\bar{\theta}}{m \vert \bar{\theta} \vert} + 1 \right)\]and its variance is
\[\exp\left( 2\bar{\theta}-m \vert \bar{\theta} \vert + \frac{\text{Var}(\hat{\theta}) + (\theta - \bar{\theta} + m \vert \bar{\theta} \vert)^2}{m \vert \bar{\theta} \vert} \right) - e^{2\theta}.\]Enjoy Reading This Article?
Here are some more articles you might like to read next: