What is the intuition behind the parameter-shift rule?

$\frac{\pi}{2}$-shift is a good pilot, but let's try any $\epsilon$-shift

To numerically compute the derivative of a function $f$ with respective to a scalar variable $\theta$, we use the finite difference:

\begin{equation}\label{eqn:finite.diff} \frac{df}{d\theta} \approx \frac{f(\theta+\epsilon) - f(\theta-\epsilon)}{2\epsilon}. \end{equation}

Mathematically, as $\epsilon \to 0$, the finite difference on the right converges to the accurate $f’$. Numerically, we should choose $\epsilon$ according to the machine precision.

While we acquire the above knowledge from Numerical Analysis 101 for electronic computers (or called “classical computers”), quantum computers paint a rather different picture. Consider that $f$ is “computed” (or “evaluated”, “measured”, whatever you call) by a quantum computer. Such an $f$ appears in variational quantum algorithms, such as VQE (computing the smallest eigenvalue of the Hamiltonian matrix), QAOA (finding the maximum cut of a graph), and quantum neural networks. The derivative formula is called the parameter-shift rule:

\[\frac{ df }{ d\theta} = \frac{ f(\theta+\frac{\pi}{2}) - f(\theta-\frac{\pi}{2}) }{2}.\]

Note that this rule is an equality, as opposed to the approximation \eqref{eqn:finite.diff}.

Where does the shift $\frac{\pi}{2}$ come from? How do people come up with this formula? For numerical analysts, $\frac{\pi}{2}$ is such a large number that it lives in a world entirely different from the small $\epsilon$. A commonly described intuition is the following equality:

\begin{equation}\label{eqn:param.shift} \frac{ d e^{i\theta} }{ d\theta} = \frac{ e^{i(\theta+\frac{\pi}{2})} - e^{i(\theta-\frac{\pi}{2})} }{2} \end{equation}

or a sine/cosine alternative of the complex exponential. As the complex exponential is so natural in quantum computing, I can live with this explanation.

Still, the lingering question is whether we can put our favorite $\epsilon$ in the parameter-shift rule, presumably making $\epsilon$ arbitrary? And a more important question is: will doing so be useful? I would answer both questions affirmatively, by introducing the identity:

\begin{equation}\label{eqn:param.shift.new} \boxed{\frac{ d e^{i\omega\theta} }{ d\theta} = \frac{ e^{i\omega(\theta+\epsilon)} - e^{i\omega(\theta-\epsilon)} }{ (2/\omega) \sin(\omega\epsilon) }.} \end{equation}

This identity offers a better intuition to understand derivative computations in quantum computing. Given any frequency $\omega \ne 0$, the identity holds for any $\epsilon$ such that $\omega\epsilon$ is not a multiple of $\pi$. When $\omega=1$ and $\epsilon=\frac{\pi}{2}$, this identity is the same as \eqref{eqn:param.shift}.

An interesting observation is that for the specific function $f(\theta) = e^{i\omega\theta}$, the right-hand side of \eqref{eqn:param.shift.new} can be written as

\[\underbrace{\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}}_{\text{finite difference}} \underbrace{\frac{\omega\epsilon}{\sin(\omega\epsilon)}}_{\text{reciprocal of sinc}}.\]

For a given $\omega$, when $\epsilon\to0$, the finite difference part converges to $f’$, while the reciprocal of sinc converges to $1$. Notably, this holds true for any differentiable $f$, not just $e^{i\omega\theta}$. Hence, the formula agrees with both the numerical analyst’s intuition and the quantum scientist’s intuition.

I will give the specific parameter-shift rules based on the intuition \eqref{eqn:param.shift.new} in the following sections.


General parameter-shift rule for Pauli operators

We first set up the notations. Let $H$ be an observable and $\ket{\psi(\theta)}$ be the normalized state vector, obtained through $\ket{\psi(\theta)} = U(\theta) \ket{\psi_0}$ where the unitary $U(\theta)$ is the parameterized circuit and $\ket{\psi_0}$ is the initial state. The function $f$ to be differentiated is the expectation value of $H$:

\[f(\theta) = \langle \psi(\theta) | H | \psi(\theta) \rangle = \langle \psi_0 | U(\theta)^{\dagger} H U(\theta) | \psi_0 \rangle.\]

Theorem 1. When the unitary $U(\theta) = e^{i \frac{\theta}{2} P}$ where $P \in \mathbb{C}^{2 \times 2}$ has eigenvalues $\pm1$ (e.g., $P$ is a Pauli operator), we have

\[\boxed{\frac{ df }{ d\theta} = \frac{ f(\theta+\epsilon) - f(\theta-\epsilon) }{ 2\sin\epsilon }}\]

for any $\epsilon$ not a multiple of $\pi$.

Proof. Note that $e^{i \frac{\theta}{2} P}$ and $P$ share the same eigen-basis and hence they commute. Therefore,

\[f' = \langle \psi_0 | U(\theta)^{\dagger} (-\tfrac{i}{2}P) H U(\theta) | \psi_0 \rangle + \langle \psi_0 | U(\theta)^{\dagger} H (\tfrac{i}{2}P) U(\theta) | \psi_0 \rangle = \langle \psi_0 | U(\theta)^{\dagger} B U(\theta) | \psi_0 \rangle\]

where

\[B = -\tfrac{i}{2} (PH - HP).\]

On the other hand, we can easily derive that $ f(\theta + \epsilon) = \langle \psi_0 | U(\theta)^{\dagger} e^{-i \frac{\epsilon}{2} P} H e^{i \frac{\epsilon}{2} P} U(\theta) | \psi_0 \rangle. $ Then, by a similar derivation for $f(\theta - \epsilon)$, we have

\[f(\theta + \epsilon) - f(\theta - \epsilon) = \langle \psi_0 | U(\theta)^{\dagger} C U(\theta) | \psi_0 \rangle\]

where

\[C = e^{-i \frac{\epsilon}{2} P} H e^{i \frac{\epsilon}{2} P} - e^{i \frac{\epsilon}{2} P} H e^{-i \frac{\epsilon}{2} P}.\]

We proceed to establish the relationship between $B$ and $C$, which in turn determines the relationship between $f’$ and $f(\theta + \epsilon) - f(\theta - \epsilon)$.

The operator $P$ has eigenvalues $\pm1$. When the two eigenvalues have the same sign, $P$ is either $I$ or $-I$, which makes $B = 0 = C$, trivially proving the theorem. The more interesting case is when the eigenvalues have different signs. Without loss of generality, we write

\[P = V D V^{\dagger} \qquad\text{where}\qquad D = \begin{bmatrix} 1 & 0 \\ 0 & -1 \end{bmatrix}.\]

That is, the columns of $V$ are the orthonormal eigen-basis of $P$. Then, letting $\hat{H} := V^{\dagger} H V$, we obtain

\[B = -\tfrac{i}{2} V (D \hat{H} - \hat{H} D) V^{\dagger} = -i V \begin{bmatrix} 0 & \hat{h}_{12} \\ -\hat{h}_{21} & 0 \end{bmatrix} V^{\dagger}.\]

On the other hand, we have

\[C = V \left( e^{-i \frac{\epsilon}{2} D} \hat{H} e^{i \frac{\epsilon}{2} D} - e^{i \frac{\epsilon}{2} D} \hat{H} e^{-i \frac{\epsilon}{2} D} \right) V^{\dagger} = -2 i (\sin\epsilon) V \begin{bmatrix} 0 & \hat{h}_{12} \\ -\hat{h}_{21} & 0 \end{bmatrix} V^{\dagger}.\]

Therefore, $C = (2\sin\epsilon) B$, which makes $f(\theta + \epsilon) - f(\theta - \epsilon) = (2\sin\epsilon) f’$, completing the proof. $\qquad\square$


General parameter-shift rule for Hermitian generators

Generally, when the circuit $U(\theta)$ is defined by a generator $G$ whose eigenvalues may not be $\pm1$ anymore, the above proof technique is not applicable. In this case, the differences of the (real) eigenvalues of $G$ enter the play and we will see that the sine/cosine version of the intuition \eqref{eqn:param.shift.new} explicitly appears in the derivative formula.

Theorem 2. When the unitary $U(\theta) = e^{i \theta G}$ where $G$ is a Hermitian generator, we can express the expectation value of $H$ as

\begin{equation}\label{eqn:Fourier.series} f(\theta) = a_0 + \sum_{\ell=1}^R a_{\ell} \cos(\omega_{\ell}\theta) + b_{\ell} \sin(\omega_{\ell}\theta) \end{equation}

for some positive integer $R$, real frequencies $\omega_{\ell}$, and real coefficients $a_{\ell}$ and $b_{\ell}$. Consequently, the derivative of $f$ can be computed, term by term, by using the general parameter-shift rule

\[\boxed{\frac{ df_{\omega} }{ d\theta} = \frac{ f_{\omega}(\theta+\epsilon) - f_{\omega}(\theta-\epsilon) }{ (2/\omega) \sin(\omega\epsilon) } \qquad\text{where}\qquad f_{\omega}(\theta) \text{ is } \cos(\omega\theta) \text{ or } \sin(\omega\theta)}\]

with any $\epsilon$ such that $\omega_{\ell} \epsilon$ is not a multiple of $\pi$ for all $\ell$.

Proof. We write

\[G = V D V^{\dagger} \qquad\text{where}\qquad D = \begin{bmatrix} \lambda_1 \\ & \ddots \\ & & \lambda_n \end{bmatrix},\]

where $\lambda_1, \ldots, \lambda_n$ are real eigenvalues and the columns of $V$ form an orthonormal eigen-basis of $G$. Then,

\[f(\theta) = \langle \psi_0 | e^{-i \theta G} H e^{i \theta G} | \psi_0 \rangle = \langle \psi_0 | V e^{-i \theta D} \hat{H} e^{i \theta D} V^{\dagger} | \psi_0 \rangle \qquad\text{where}\qquad \hat{H} := V^{\dagger} H V.\]

We let the elements of $\hat{H}$ be $\hat{h}_{jk}$ and the elements of $V^{\dagger} \ket{\psi_0}$ be $v_j$. Then,

\[f(\theta) = \sum_{j,k=1}^n \bar{v}_j e^{-i\theta\lambda_j} \hat{h}_{jk} e^{i\theta\lambda_k} v_k = \sum_{j,k=1}^n \bar{v}_j v_k \hat{h}_{jk} e^{i\theta (\lambda_k - \lambda_j)}.\]

We split the double summation into three parts: the diagonal part $j=k$, the lower triangular part $j > k$, and the upper triangular part $j < k$. Note that the lower triangular part is the complex conjugate of the upper triangular part. Therefore,

\[f(\theta) = \sum_{j=1}^n |v_j|^2 \hat{h}_{jj} + \overline{\sum_{j<k} \bar{v}_j v_k \hat{h}_{jk} e^{i\theta (\lambda_k - \lambda_j)}} + \sum_{j<k} \bar{v}_j v_k \hat{h}_{jk} e^{i\theta (\lambda_k - \lambda_j)}.\]

Applying Euler’s formula, all the summation terms of $f$ become real:

\[f(\theta) = \underbrace{ \sum_{j=1}^n |v_j|^2 \hat{h}_{jj} }_{a_0} + \sum_{j<k} \underbrace{ 2 \Re(\bar{v}_j v_k \hat{h}_{jk}) }_{c_{jk}} \cos(\theta (\lambda_k - \lambda_j)) + \underbrace{ 2 \Im(-\bar{v}_j v_k \hat{h}_{jk}) }_{d_{jk}} \sin(\theta (\lambda_k - \lambda_j)).\]

We merge the $c_{jk}$ terms, where $\lambda_k - \lambda_j$ are the same or differ only by sign, and call it $a_{\ell} \cos(\theta \omega_{\ell})$. The same merge applies to the other summation term, resulting in $b_{\ell} \sin(\theta \omega_{\ell})$. Then,

\[f(\theta) = a_0 + \sum_{\ell} a_{\ell} \cos(\theta\omega_{\ell}) + b_{\ell} \sin(\theta\omega_{\ell}),\]

which completes the proof. $\qquad\square$

Remark. Clearly, Theorem 1 is a special case of Theorem 2, because $\frac{1}{2}P$ is a special case of $G$. When $P$ has an eigenvalue $-1$ and another one $1$, their absolute difference divided by $2$ is $1$. Then, there is only $R=1$ term in \eqref{eqn:Fourier.series}, with $\omega_{\ell}=1$.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Einstein sum is convenient but slow
  • Glasser's master theorem and an application