This workshop is also streamed virtually on zoom. For access (requires registration), click here.
Overview
Virtual Try-On is an emerging consumer application that enables users to perceive products on their unique bodies in a virtual or mixed reality space. The retail e-commerce industry is beginning to heavily adopt these technologies within their offerings enabling their users to visualize products especially in the beauty, fashion, and accessories space before they make purchases, and provide opportunities to customize and personalize products. In principle, try-on experiences can have significant environmental impact by reducing the need to return products, improving satisfaction of purchased ones and improving accessibility. Enabling these applications requires solving diverse challenges in the space of computer vision, 3D modeling and reconstruction, geometry processing and learning, generative AI and perception. This is an active and multi-disciplinary area of research. The primary goal of this inaugral workshop is to bring together expert academic and industry researchers as well as young researchers working in this space to present, discuss and understand the state of the art and open challenges in this area that are core to enabling a convincing, useful and safe try-on experience.
Keynote Speakers
|
|
|
|
Ming Lin |
Ira Kemelmacher-Schlizerman |
Gerard Pons-Moll |
Sunil Hadap |
|
|
|
|
Invited Short Talks
|
|
|
|
|
Abstract We present M&M VTO–a mix and match virtual try-on method that takes as input multiple garment images, text description for garment layout and an image of a person. An example input includes: an image of a shirt, an image of a pair of pants, "rolled sleeves, shirt tucked in", and an image of a person. The output is a visualization of how those garments (in the desired layout) would look like on the given person. Key contributions of our method are: 1) a single stage diffusion based model, with no super resolution cascading, that allows to mix and match multiple garments at 1024x512 resolution preserving and warping intricate garment details, 2) architecture design (VTO UNet Diffusion Transformer) to disentangle denoising from person specific features, allowing for a highly effective finetuning strategy for identity preservation (6MB model per individual vs 4GB achieved with, e.g., dreambooth finetuning); solving a common identity loss problem in current virtual try-on methods, 3) layout control for multiple garments via text inputs specifically finetuned over PaLI-3 for virtual try-on task. Experimental results indicate that M&M VTO achieves state-of-the-art performance both qualitatively and quantitatively, as well as opens up new opportunities for virtual try-on via language-guided and multi-garment try-on. |
|
|
Abstract In this talk, we'll introduce a novel approach, ucVTON, for photorealistic virtual try-on of personalized clothing on human images. Unlike previous methods limited by input types, ours allows flexible style (text or image) and texture (full garment, cropped sections, or patches) specifications. To tackle the challenge of full garment entanglement, we use a two-stage pipeline to separate style and texture. We first generate a human parsing map for desired style and then composite textures onto it based on input. Our method introduces hierarchical CLIP features and position encoding in VTON for complex, non-stationary textures, setting a new standard in fashion editing. |
|
|
Abstract The growing digital landscape of fashion e-commerce calls for interactive and user-friendly interfaces for virtually trying on clothes. Traditional try-on methods grapple with challenges in adapting to diverse backgrounds, poses, and subjects. While newer methods, utilizing the recent advances of diffusion models, have achieved higher-quality image generation, the human-centered dimensions of mobile interface delivery and privacy concerns remain largely unexplored. We present Mobile Fitting Room, the first on-device diffusion-based virtual try-on system. To address multiple inter-related technical challenges such as high-quality garment placement and model compression for mobile devices, we present a novel technical pipeline and an interface design that enables privacy preservation and user customization. A usage scenario highlights how our tool can provide a seamless, interactive virtual try-on experience for customers and provide a valuable service for fashion e-commerce businesses. |
|
|
Abstract With recent advances in content generation and rendering tasks using generative models, real-time video virtual try-on remains challenging, especially on mobile devices and web browsers. We present our framework and a series of works that bridge the gap between state-of-the-art neural networks and real-world challenges, constrained by device and data limitations. |
|
|
Abstract We introduce two types of makeup prior models—PCA-based and StyleGAN2-based—to enhance existing 3D face prior models. These priors are pivotal in estimating 3D makeup patterns from single makeup face images. Such patterns play a significant role in a broad spectrum of makeup-related applications, substantially enriching virtual try-on technologies with more realistic and customizable experiences. Our contributions support crucial functionalities, including 3D makeup face reconstruction, user-friendly makeup editing, makeup removal, makeup transfer, and interpolation. |
|
|
Abstract Makeup transfer aims to realistically and naturally reproduce diverse makeup styles onto a given face image. Due to the inherent unsupervised nature of makeup transfer, most previous approaches adopt the pseudo-ground-truth-guided strategy for model training. In this talk, we first reveal that the quality of the pseudo ground truth is the key factor limiting the performance of makeup transfer. Next, we propose a Content-Style Decoupled Makeup Transfer (CSD-MT) method, which works in a purely unsupervised manner and thus eliminates the negative effects of generating PGTs. Finally, extensive quantitative and qualitative analyses show the effectiveness of our CSD-MT method. |
|
|
Abstract We present a framework that takes as input a single-layer clothed 3D human mesh and decomposes it into complete multi-layered 3D assets. The outputs can then be combined with other assets to create novel clothed human avatars with any pose. We first separate the input mesh using the 3D surface segmentation extracted from multi-view 2D segmentations. Then we synthesize the missing geometry of different layers in both posed and canonical spaces using a novel pose-guided Score Distillation Sampling (SDS) loss. Once we complete inpainting high-fidelity 3D geometry, we also apply the same SDS loss to its texture to obtain the complete appearance including the initially occluded regions. Through a series of decomposition steps, we obtain multiple layers of 3D assets in a shared canonical space normalized in terms of poses and human shapes, hence supporting effortless composition to novel identities and reanimation with novel poses. |
|
|
Abstract Given a clothing image and a person image, an image-based virtual try-on aims to generate a customized image that appears natural and accurately reflects the characteristics of the clothing image. In this presentation, we introduce StableVITON, learning the semantic correspondence between the clothing and the human body within the latent space of the pre-trained diffusion model in an end-to-end manner. Our proposed zero cross-attention blocks not only preserve the clothing details by learning the semantic correspondence but also generate high-fidelity images by utilizing the inherent knowledge of the pre-trained model in the warping process. Through our proposed novel attention total variation loss and applying augmentation, we achieve the sharp attention map, resulting in a more precise representation of clothing details. StableVITON shows state-of-the-art performance over existing virtual try-on models in both qualitative and quantitative results. Moreover, through the evaluation of a trained model on multiple datasets, StableVITON demonstrates its promising quality in a real-world setting. |
|
|
Abstract With the rising success of image foundation models in recent years, we are also witnessing exciting advances in the world of VTO. However, there remains a large gap between theory and practice. In our conversations with customers, we were able to uncover a number of challenges that are under-explored in academia, namely hyper-fidelity, visual aesthetics and style diversity. We will share our early efforts in refining the technology into a commercial-grade product and discuss the shortcomings of current evaluation benchmarks in accurately representing industrial needs. |
|
|
Abstract As online shopping is growing, the ability for buyers to virtually visualize products in their settings—a phenomenon we define as "Virtual Try-All"—has become crucial. Recent diffusion models inherently contain a world model, rendering them suitable for this task within an inpainting context. However, traditional image-conditioned diffusion models often fail to capture the fine-grained details of products. In contrast, personalization-driven models such as DreamPaint are good at preserving the item's details but they are not optimized for real-time applications. We present "Diffuse to Choose," a novel diffusion-based image-conditioned inpainting model that efficiently balances fast inference with the retention of high-fidelity details in a given reference item while ensuring accurate semantic manipulations in the given scene content. Our approach is based on incorporating fine-grained features from the reference image directly into the latent feature maps of the main diffusion model, alongside with a perceptual loss to further preserve the reference item's details. We conduct extensive testing on both in-house and publicly available datasets, and show that Diffuse to Choose is superior to existing zero-shot diffusion inpainting methods as well as few-shot diffusion personalization algorithms like DreamPaint. |
Schedule
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Organizers
|
|
|
|
Vidya Narayanan |
Sunil Hadap |
Javier Romero |
|
|
|
|
|
|
|
|
|
Katie Lewis |
Hanbyul Joo |
Alla Sheffer |
Hao (Richard) Zhang |
|
|
|
|
Contact Info
E-mail: vtocvpr24 AT gmail.com
Header image credits: Sunil Hadap, Medium: ChatGPT 4/Dall-E Prompt: "Photorealistic wide aspect image of a lady in simple clothes shopping using virtual try-on technology for a fancy outfit." seed: 13
Website based on https://futurecv.github.io/.