Learning a model of facial shape and expression from 4D scans

liblaf1/16/23About 5 min

[PDF] games-cn.org

3. Model formulation

FLAME is described by a function:

M \pqty \vec{β}, \vec{θ}, \vec{ψ} : R^{\abs \vec{β} \times \abs \vec{θ} \times \abs \vec{ψ}} \to R^{3 N}

$\vec{β}$ — coefficients describing shape
$\vec{θ}$ — coefficients describing pose
$\vec{ψ}$ — coefficients describing expression
$\overset{―}{\vb T} \in R^{3 N}$ — template mesh in the “zero pose” ${\vec{θ}}^{*}$
${\vec{θ}}^{*}$ — “zero pose”
$B_{S} \pqty \vec{β}; S : R^{\abs \vec{β}} \to R^{3 N}$ — shape blendshape function to account for identity related shape variation
$B_{P} \pqty \vec{θ}; P : R^{\abs \vec{θ}} \to R^{3 N}$ — corrective pose blendshapes to correct pose deformations that cannot be explained solely by LBS
$B_{E} \pqty \vec{ψ}; E : R^{\abs \vec{ψ}} \to R^{3 N}$ — expression blendshapes that capture facial expressions
$W \pqty \overset{―}{\vb T}, \vb J, \vec{θ}, W$ — A standard skinning function is applied to rotate the vertices of $\overset{―}{\vb T}$ around joints $\vb J \in R^{3 K}$ , linearly smoothed by blendweights $W \in R^{K \times N}$

M \pqty \vec{β}, \vec{θ}, \vec{ψ} = W \pqty T_{P} \pqty \vec{β}, \vec{θ}, \vec{ψ}, \vb J \pqty \vec{β}, \vec{θ}, W

T_{P} \pqty \vec{β}, \vec{θ}, \vec{ψ} = \overset{―}{\vb T} + B_{S} \pqty \vec{β}; S + B_{P} \pqty \vec{θ}; P + B_{E} \pqty \vec{ψ}; E

\vb J \pqty \vec{β}; T, \overset{―}{\vb T}, S = T \pqty \overset{―}{\vb T} + B_{S} \pqty \vec{β}; S

$T$ — a sparse matrix defining how to compute joint locations from mesh vertices

Shape blendshapes

B_{S} \pqty \vec{β}; S = \sum_{n = 1}^{\abs \vec{β}} β_{n} \vb S_{n}

$\vec{β} = \bqty {β_{1}, \dots, β_{\abs \vec{β}}}^{T}$ — shape coefficients
$S = \bqty \vb S_{1}, \dots, \vb S_{\vec{β}} \in R^{3 N \times \abs \vec{β}}$ — orthonormal shape basis, which will be learned below with PCA

Pose blendshapes

B_{P} \pqty \vec{θ}; P = \sum_{n = 1}^{9 K} \pqty R_{n} \pqty \vec{θ} - R_{n} \pqty {\vec{θ}}^{*} \vb P_{n}

$R \pqty \vec{θ} : R^{\abs \vec{θ}} \to R^{9 K}$ — a function from a face / head / eye pose vector $\vec{θ}$ to a vector containing the concatenated elements of all the corresponding rotation matrices
$R_{n} \pqty \vec{θ}, R_{n} \pqty {\vec{θ}}^{*}$ — $n$ -th element of $R \pqty \vec{θ}$ and $R \pqty {\vec{θ}}^{*}$
vector $\vb P_{n} \in R^{3 N}$ — vertex offsets from the rest pose activated by $R_{n}$
$P = \bqty \vb P_{1}, \dots, \vb P_{9 K} \in R^{3 N \times 9 K}$ — pose space, a matrix containing all pose blendshapes

Expression blendshapes

B_{E} \pqty \vec{ψ}; E = \sum_{n = 1}^{\abs \vec{ψ}} {\vec{ψ}}_{n} \vb E_{n}

$\vec{ψ} = \bqty {ψ_{1}, \dots, ψ_{\abs \vec{ψ}}}^{T}$ — expression coefficients
$E = \bqty \vb E_{1}, \dots, \vb E_{\abs \vec{ψ}} \in R^{3 N \times \abs \vec{ψ}}$ — orthonormal expression basis

Template shape

4. Temporal registration

4.1. Initial model

Shape

Pose

Expression

4.2. Single-frame registration

Model-only

estimate the model coefficients $\Bqty \vec{β}, \vec{θ}, \vec{ψ}$ by optimizing

E \pqty \vec{β}, \vec{θ}, \vec{ψ} = E_{D} + λ_{L} E_{L} + E_{P}

E_{D} = λ_{D} \sum_{\vb v_{s}} ρ \pqty min_{\vb v_{m} \in M \pqty \vec{β}, \vec{θ}, \vec{ψ}} \norm \vb v_{s} - \vb v_{m}

$E_{D}$ — measures the scan-to-mesh distance of the scan vertices $\vb v_{s}$ and the closest point in the surface of the model
$\vb v_{s}$ — scan vertices
$λ_{D}$ — weight controls the influence of the data term
$ρ$ — a Geman-McClure robust penalty function
$E_{L}$ — a landmark term, measuring the L2-norm distance between image landmarks and corresponding vertices on the model template, projected into the image using the known camera calibration

E_{P} = λ_{\vec{θ}} E_{\vec{θ}} + λ_{\vec{β}} E_{\vec{β}} + λ_{\vec{ψ}} E_{\vec{ψ}}

$E_{P}$ — regularizes the pose coefficients $\vec{θ}$ , shape coefficients $\vec{β}$ , and expression coefficients $\vec{ψ}$ to be close to zero by penalizing their squared values

Coupled

allow the optimization to leave the model space by optimizing

E \pqty \vb T, \vec{β}, \vec{θ}, \vec{ψ} = E_{D} + E_{C} + E_{R} + E_{P}

$\vb T$ — template mesh
$E_{D}$ — measures the scan-to-mesh distance from the scan to the aligned mesh $\vb T$
$E_{C}$ — constrains $\vb T$ to be close to the current statistical model by penalizing edge differences between $\vb T$ and the model $M \pqty \vec{β}, \vec{θ}, \vec{ψ}$ as
$E_{C} = \sum_{e} λ_{e} \norm \vb T_{e} - M \pqty {\vec{β}, \vec{θ}, \vec{ψ}}_{e}$

- $ \vb T_{e}, M \pqty {\vec{β}, \vec{θ}, \vec{ψ}}_{e} $ - - - e d g e s o f $ \vb T $ a n d $ M \pqty \vec{β}, \vec{θ}, \vec{ψ} $ - $ λ_{e} $ - - - a n i n d i v i d u a l w e i g h t a s s i g n e d t o e a c h e d g e

E_R = \frac{1}{N} \sum_{k = 1}^N \lambda_k \norm{U\pqty{\vb{v}_k}}^2

- $E_R$ --- regularization term for each vertex $\vb{v}_k \in \mathbb{R}^3$ in $\vb{T}$ - $U\pqty{\vb{v}} = \frac{\sum_{\vb{v}_r \in \mathcal{N}\pqty{\vb{v}}} \vb{v}_r - \vb{v}}{\abs{\mathcal{N}\pqty{\vb{v}}}}$ - $\mathcal{N}\pqty{\vb{v}}$ --- the set of vertices in the one-ring neighborhood of $\vb{v}$ #### Texture-based

E\pqty{\vb{T}, \vec{\beta}, \vec{\theta}, \vec{\psi}} = E_D + E_C + \lambda_T E_T + E_R + E_P

- $ E_{T} $ - - - m e a s u r e s t h e * * p h o t o m e t r i c e r r o r * * b e t w e e n r e a l i m a g e $ I $ a n d t h e r e n d e r e d t e x t u r e d i m a g e $ \hat{I} $ o f $ \vb T $ f r o m a l l $ V $ v i e w s

E_T = \sum_{l = 0}^3 \sum_{v = 1}^V \norm{\Gamma\pqty{I_l^{\pqty{v}}} - \Gamma\pqty{\hat{I}_l^{{\pqty{v}}}}_F}2

- $\norm{\vb{X}}_F$ --- Frobenius norm of $\vb{X}$ - $\Gamma$ --- ratio of Gaussian filters help minimize the influence of lighting changes between real and rendered images - $I_l^{\pqty{v}}$ --- the image $I$ of resolution level $l$ from view $v$ ### 4.3. Sequential registration #### Personalization - use a coupled registration ( Equation $\eqref{eq:9}$ ) and average the results $\vb{T}_i$ across multiple sequences to get a personalized template for each subject - randomly select one of the $\vb{T}$ for each subject to generate a personalized texture map #### Sequence fitting - replace the generic model template $\overline{\vb{T}}$ in $M$ $\eqref{eq:1}$ by personalized template - fix the $\vec{\beta}$ to zero - initialize the model parameters from the previous frame and use the single-frame registration 4.2. ## 6. Model training **decouple** shape, pose, and expression variations - $\Bqty{\mathcal{P}, \mathcal{W}, \mathcal{T}}$ --- pose parameters - $\mathcal{E}$ --- expression parameters - $\Bqty{\overline{\vb{T}}, \mathcal{S}}$ --- shape parameters ### 6.1. Pose parameter training - $\vb{T}_i^P$ --- personalized rest-pose templates - $\vb{J}_i^P$ --- person specific joints - $\mathcal{W}$ --- blendweights - $\mathcal{P}$ --- pose blendshapes - $\mathcal{T}$ --- joint regressor alternate between: - solve for the pose parameters $\vec{\theta}_j$ of each registration $j$ - optimize the subject specific parameters $\Bqty{\vb{T}_i^P, \vb{J}_i^P}$ - optimize the global parameters $\Bqty{\mathcal{W}, \mathcal{P}, \mathcal{T}}$ objective function being optimized consists of: - data term $E_D$ --- penalizes the squared Euclidean reconstruction error of the training data - regularization term $E_{\mathcal{P}}$ --- penalizes the Frobenius norm of the pose blendshapes - regularization term $E_{\mathcal{W}}$ --- penalizes large deviations of the blendweights from their initialization > To avoid $\vb{T}_i^P$ and $\vb{J}_i^P$ being affected by strong facial expressions, expression effects are removed when solving for $\vb{T}_i^P$ and $\vb{J}_i^P$. This is done by jointly solving for pose $\vec{\theta}$ and expression parameters $\vec{\psi}$ for each registration, subtracting $B_E$ (Equation $\eqref{eq:5}$), and solving for $\vb{T}_i^P$ and $\vb{J}_i^P$ on those residuals. ### 6.2. Expression parameter training - solve for the pose parameters $\vec{\theta}_j$ of each registration - **unpose**: remove the pose influence by applying the inverse transformation entailed by $M\pqty{\vec{0}, \vec{\theta}, \vec{0}}$ (Equation $\eqref{eq:1}$) - $\vb{V}_j^U$ --- the vertices resulting from unposing the registration $j$ - $\vb{V}_i^{NE}$ --- the vertices of the neutral expression of subject $i$, also unposed - compute expression residuals $\vb{V}_j^U - \vb{V}_{s\pqty{j}}^{N E}$ for each registration $j$ - $s\pqty{j}$ --- the subject index $j$ - compute expression space $\mathcal{E}$ by applying PCA ### 6.3. Shape parameter training - $\overline{\vb{T}}$ --- computed as the mean of these expression- and pose-normalized registrations - $\mathcal{S}$ --- formed by the first $\vec{\beta}$ principal components computed using PCA ### 6.4. Optimization structure Due to the high capacity and flexibility of the expression space formulation, **pose blendshapes** should be trained **before** **expression parameters** in order to avoid expression overfitting.