Fashion Style Editing with
Generative Human Prior

overview

Abstract

Image editing has been a long-standing challenge in the research community with its far-reaching impact on numerous applications. Recently, text-driven methods started to deliver promising results in domains like human faces, but their applications to more complex domains have been relatively limited. In this work, we explore the task of fashion style editing, where we aim to manipulate the fashion style of human imagery using text descriptions. Specifically, we leverage a generative human prior and achieve fashion style editing by navigating its learned latent space. We first verify that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal, and propose two directions to reinforce the guidance: textual augmentation and visual referencing. Combined with our empirical findings on the latent space structure, our Fashion Style Editing framework (FaSE) successfully projects abstract fashion concepts onto human images and introduces exciting new applications to the field.

Problem: Fashion Style Editing

Our goal is to develop a learning-based system that can manipulate the fashion style of human imagery using text descriptions. This task has a wide range of relevant applications, including virtual try-on, product design and content creation. Despite being a subtask of general image editing, it poses a unique set of challenges due to the innate elusiveness of fashion style concepts (e.g., 'street') and the need for delicate modification of whole-body human images.

Approach: FaSE

overview

We propose Fashion Style Editing (FaSE), a novel framework that leverages a generative human prior to achieve fashion style editing by navigating its learned latent space. We choose StyleGAN-Human (Fu and Li et al. (2022)) as our generative prior, for its impressive capacity for whole human image generation.


Upon this pretrained human generation system, we envision something like StyleCLIP (Patashnik et al. (2021)). Hence, we initially run experiments with naive StyleCLIP latent-mapper implementation with little success as illustrated below.


overview

We find that the existing text-driven editing methods fall short for our problem due to their overly simplified guidance signal. To address this, we propose two directions to reinforce the guidance: textual augmentation and visual referencing. The former involves generating additional text descriptions to enrich the guidance signal, while the latter involves providing visual references to the model by retrieving relevant images from our fashion database. These two strategies successfully enhance the guidance signal and enable fashion style editing, as shown below.


overview

Experiments

overview

Evident from the above figure, our FaSE framework successfully projects abstract fashion concepts with minimal irrelevant modification to the base image. We can see that applying naive StyleCLIP method to fashion domain fails in visualizing high-level target concepts.


overview

We further share insights on the latent space structure of StyleGAN-Human. When we compartmentalize the W+ space into three sections as done in StyleCLIP, we empirically discover that mid and fine layers mostly contribute to fashion style editing, while the coarse layer is mainly responsible for the posture of the generated human. Among these two, the mid layer is more sensitive to the high-level fashion concepts (e.g., 'formal'), while the fine layer is more sensitive to texture details (e.g., 'blue color').