强化学习中的 State value functionQ function

1、一些概念

1.1、回报(Return)

智能体的目标是最大化回报。通常,回报需要定义一个折扣因子 \(\gamma\),回报函数如下: \[ \begin{aligned} G_t &= R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}...\\ &= \sum_{k = 0}^{\infty}\gamma^k r_{t+k+1} \end{aligned} \]

1.2、策略(Policy)

策略的定义为 $ : S A $

1.3、State value function

State value function描述的是在策略 \(\pi\) 下该状态 \(s\) 下有多好。 \[ V^{\pi}(s)=\mathbb{E}_{\pi}[G_t|s_t=s] \]

1.4、Q function

Q function 描述的是在策略 \(\pi\) 下,在状态 \(s\) 采取动作 \(a\) 有多好。 \[ Q^\pi(s, a) = \mathbb{E}[G_t|s_t=s,a_t=a] \]


2、求解 State value function 和 Q function

贝尔曼方程(Bellman equation)是理查德·贝尔曼推导出来的,可以帮助解决马尔可夫决策问题(MDP)。

根据回报的定义,我们可以将 State value function 表示为: \[ \begin{aligned} V^\pi(s)&=\mathbb{E}_\pi[R_{t+1}+\gamma R_{t+2}+\gamma^2 R_{t+3}+...|s_t=s]\\ &=\mathbb{E}_\pi[\sum_{k=0}^{\infty}\gamma^k R_{t+k+1}|s_t=s] \end{aligned} \] 定理 1: \[ \mathbb{E}_{\pi}\left[\boldsymbol{v}_{\pi}\left(S_{t+1}\right) | \boldsymbol{S}_{t}\right]=\mathbb{E}_{\pi}\left[\mathbb{E}_{\pi}\left[\boldsymbol{G}_{t+1} | \boldsymbol{S}_{t+1}\right] | \boldsymbol{S}_{t}\right]=\mathbb{E}_{\pi}\left[\boldsymbol{G}_{t+1} | \boldsymbol{S}_{t}\right] \]

证明

其中 \[ {s}=S_{t,} {g}^{\prime}={G}_{t+1}, {s}^{\prime}={S}_{t+1} \]

\[ \begin{aligned} \mathbb{E}_{\pi}\left[\mathbb{E}_{\pi}\left[G_{t+1} | S_{t+1}\right] | S_{t}\right] &=\mathbb{E}_{\pi}\left[\mathbb{E}_{\pi}\left[g^{\prime} | s^{\prime}\right] | s\right] \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} g^{\prime} p\left(g^{\prime} | s^{\prime}, s\right) p\left(s^{\prime} | s\right) \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime} | s^{\prime}, s\right) p\left(s^{\prime} | s\right) p(s)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime} | s^{\prime}, s\right) p\left(s^{\prime}, s\right)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} \frac{g^{\prime} p\left(g^{\prime}, s^{\prime}, s\right)}{p(s)} \\ &=\sum_{s^{\prime}} \sum_{g^{\prime}} g^{\prime} p\left(g^{\prime}, s^{\prime} | s\right) \\ &=\sum_{g^{\prime}} \sum_{s^{\prime}} g^{\prime} p\left(g^{\prime}, s^{\prime} | s\right) \\ &=\sum_{g^{\prime}} g^{\prime} p\left(g^{\prime} | s\right) \\ &=\mathbb{E}_{\pi}\left[g^{\prime} | s\right]=\mathbb{E}_{\pi}\left[G_{t+1} | S_{t}\right] \end{aligned} \]

进而得到 State value function

\[ \begin{aligned} v_{\pi}(s) &=\mathbb{E}_{\pi}\left[G_{t} | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma R_{t+2}+\gamma^{2} R_{t+3}+\cdots | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma \sum_{k=0}^{\infty} \gamma^{k} R_{(t+1)+k+1} | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma G_{t+1} | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma \mathbb{E}_{\pi}\left[G_{t+1} | S_{t+1}\right] | S_{t}=s\right] \\ &=\mathbb{E}_{\pi}\left[R_{t+1}+\gamma v_{\pi}\left(S_{t+1}\right) | S_{t}=s\right] \\ &=\sum_{a} \pi(a | s) \sum_{r} p(r | s, a) r+\gamma \sum_{a} \pi(a | s) \sum_{s^{\prime}} p\left(s^{\prime} | s, a\right) v_{\pi}\left(s^{\prime}\right) \\ &=\sum_{a} \pi(a | s) \sum_{r} \sum_{s^{\prime}} p\left(s^{\prime}, r | s, a\right) r+\gamma \sum_{a} \pi(a | s) \sum_{r} \sum_{r} p\left(s^{\prime}, r | s, a\right) v_{\pi}\left(s^{\prime}\right) \\ &=\sum_{a} \pi(a | s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right) r+\gamma \sum_{a} \pi(a | s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right) v_{\pi}\left(s^{\prime}\right) \\ &=\sum_{a} \pi(a | s) \sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right] \end{aligned} \]

同理可推导 Q function 为:

\[ q_{\pi}(s, a)=\sum_{s^{\prime}} \sum_{r} p\left(s^{\prime}, r | s, a\right)\left[r+\gamma v_{\pi}\left(s^{\prime}\right)\right] \]