This patent describes a novel system and method for image inpainting, where missing or undesirable regions of an input image are filled in using guidance from a separate “guide image.” The core innovation lies in the use of machine learning models, particularly a Style Generative Adversarial Network (StyleGAN), to combine visual features from both the input image and the guide image in a deep latent space. This approach aims to generate inpainted content that is not only consistent with the remaining parts of the input image but also incorporates desirable visual characteristics from the guide image, offering greater control and improved quality, especially for large or complex missing regions.
Most Important Ideas
Problem Addressed: Traditional image inpainting based solely on the remaining parts of an incomplete image can struggle with large missing regions, complex patterns, and lacks control over the style and content of the generated pixels. Existing methods might produce undesirable artifacts.
Proposed Solution: Example-Guided Inpainting: The invention introduces an inpainting system that utilizes a “guide image” to provide new and controllable image content for filling in the missing region of an “input image.”
“An inpainting system is configured to inpaint a missing region of an input image based on a guide image. The input image and the guide image may be different images, and the guide image may thus provide new and controllable image content for guiding the inpainting of the missing region.”
System Architecture: The inpainting system comprises:
- Encoders: One or more encoders to generate latent representations of both the input and guide images.
- Latent Representation Operator: A component to combine these latent representations, for instance, through concatenation or cross-attention mechanisms.
- StyleGAN Model: A StyleGAN model is central to generating an intermediate output image with the inpainted content by leveraging the combined latent representation. The StyleGAN is trained to blend visual features from both input and guide images in its deep latent space.
- Output Image Generator: This component replaces the missing region of the original input image with the inpainted content from the StyleGAN’s output, often guided by an “inpainting mask.”
Advantages of the Approach:
- Control: The guide image allows users to control the visual appearance (e.g., color palette, geometric features, style) of the inpainted region.
- Improved Quality for Large Missing Regions: The guide image provides additional information beyond the input image, enabling better inpainting for substantial missing areas.
- Feature Blending in Latent Space: Combining features in a deep latent space (rather than pixel space) enhances the coherence and plausibility of the inpainted content.
StyleGAN Architectures Considered: The patent discusses using different versions and architectural aspects of StyleGAN (e.g., StyleGAN1, StyleGAN2, StyleGAN3), highlighting components like mapping networks, synthesis networks with style blocks, affine transformations, noise injection, adaptive instance normalization, weight demodulation, and alias-free architectures.
Specific Architectural Element: Similarity Calculation: One embodiment introduces a “similarity calculator” to explicitly quantify the resemblance between the input and guide images. This metric can then be used to modulate how the features of the guide image are incorporated into the inpainted region, ensuring better contextual consistency.
“By quantifying the similarity between different aspects of guide image 304 and input image 302, similarity metric 442 may allow StyleGAN model 330B to incorporate into intermediate output image 332 aspects of guide image 304 that are similar to aspects of input image 302.”
Specific Architectural Element: Cross-Attention: Another embodiment utilizes cross-attention mechanisms between the intermediate states of the encoder (processing the guide image) and the intermediate outputs of the StyleGAN’s synthesis network. This allows the model to find and utilize similar patches or features from the guide image in a semantically relevant way for inpainting.
“Cross-attention calculator 464 may thus search guide image 304 for patches and/or features that are similar to input image 302, thereby allowing similar, rather than dissimilar patches and/or features of guide image 304 to be used for generating the inpainted image content.”
Training Process: The system is trained using a combination of loss functions:
- Perceptual Loss: Measures the visual similarity between the inpainted content and the guide image (and potentially the original input image) in a feature space extracted by a perceptual loss model.
- Adversarial Loss: Employed through a discriminator network to ensure the inpainted content looks realistic and natural by distinguishing between generated and real images.
The training process involves adjusting the parameters of the inpainting system (encoders, StyleGAN) to minimize these loss values.
Example Use Case: The document provides an example where a region of a building with an “Industrial” architectural style is inpainted using a guide image of a building with a “Gothic” style. The output image shows the inpainted region now resembling the Gothic style while remaining coherent with the rest of the scene.