Asset Generation
Last updated
Last updated
High-quality object meshes are essential for various use cases in movies, gaming, e-commerce, and AR/VR. In this work, we tackle the problem of generating high-quality 3D object mesh from a single image. This is a ill-posed and challenging problem as this requires reasoning about the object’s 3D shape and texture from only a single 2D projection (image) of that object. Single-image object generation can simplify the tedious and manual object creation process
We achieve text-to-3D generation by utilizing a pretrained text-to-image diffusion model ε as an image prior to optimize the 3D representation parameterized by θ. The image x = g(θ), rendered at random viewpoints by a volumetric renderer, is expected to rep-resent a sample drawn from the text-conditioned image distribution p(x y) modeled by a pretrained diffusion model. The diffusion model φ is trained to predict the sampled noise ε (x ;y, t) of the noisy image x at the noise level t, conditioned on the text prompt y. A score distillation sam-pling (SDS) loss encourages the rendered images to match the distribution modeled by the diffusion model. Specifically, the SDS loss computes the gradient:
This is the per-pixel difference between the predicted and the added noise upon the rendered image, where ω(t) is the weighting function.
One way to improve the generation quality of a conditional diffusion model is to use the classifier-free guidance (CFG) technique to steer the sampling slightly away from the unconditional sampling, i.e., ε (x ;y, t)+ε (x ;y, t) ε (x , t,∅), where ∅ represents the “empty” text prompt.
Text-to-image synthesis has been explored via models like Imagen [22], DALL-E 3 [23]. We have used diffusion models for text-to-image tasks as diffusion models tend to be faster. Many text-to-image models also enable users control over their picture by prompting the model to change specific regions of an image via text, known as inpainting.
However, inpainting for these models is often limited to text. Our method first leverages a state-of-the-art text-to-image generative model to generate a high-quality 2D image from a text prompt. In this way, we can leverage the full power of state-of-the-art 2D diffusion models to depict intricate visual semantics described in the text, retaining the creative freedom as 2D models.
We then lift this image to 3D through cascaded stages of geometric sculpting and texture boosting. By decomposing the problem, we can apply specialized techniques at each stage. For geometry, we prioritize multi-view consistency and global 3D structure, allowing for some compromise on detailed textures. With the geometry fixed, we then focus solely on optimizing realistic and coherent texture, for which we jointly learn a 3D-aware diffusion prior that bootstraps the 3D optimization. In the next, we elaborate on key design considerations for the two phases.