In this work, we present FitDiff, a diffusion-based 3D facial avatar generative model. This model accurately generates relightable facial avatars, utilizing an identity embedding extracted from an "in-the-wild" 2D facial image.
Our multi-modal diffusion model concurrently outputs facial reflectance maps(albedo, specular, and normals) and shapes, showcasing great generalization capabilities. It is solely trained on an annotated subset of a public facial dataset, paired with 3D reconstructions. We revisit the typical 3D facial fitting approach by guiding a reverse diffusion process using perceptual and face recognition losses.
Being the first LDM conditioned on face recognition embeddings, FitDiff reconstructs relightable human avatars, that can be used as-is in typical rendering engines, starting only from a single unconstrained facial image while achieving state-of-the-art performance.
FitDiff can generate diverse facial identities without the need for pre-existing input. These assets offer significant potential across diverse applications, including enhancing existing datasets through augmentation and enrichment, as well as the creation of genuinely random identities for computer-based applications.