320 X. Wang and A. Gupta
However, learning S
2
-GAN is still not an easy task. To tackle this challenge,
we first learn the Style-GAN and Structure-GAN in an independent manner. We
use the NYUv2 RGBD dataset [14] with more than 200 K frames for learning the
initial networks. We train a Structure-GAN using the ground truth surface nor-
mals from Kinect. Because the perspective distortion of texture is more directly
related to normals than to depth, we use surface normal to represent image
structure in this paper. We learn in parallel our Style-GAN which is conditional
on the ground truth surface normals. While training the Style-GAN, we have two
loss functions: the first loss function takes in an image and the surface normals
and tries to predict if they correspond to a real scene or not. However, this loss
function alone does not enforce explicit pixel based constraints for aligning gen-
erated images with input surface normals. To enforce the pixel-wise constraints,
we make the following assumption: if the generated image is realistic enough, we
should be able to reconstruct or predict the 3D structure based on it. We achieve
this by adding another discriminator network. More specifically, the generated
image is not only forwarded to the discriminator network in GAN but also a
input for the trained surface normal predictor network. Once we have trained
an initial Style-GAN and Structure-GAN, we combine them together and per-
form end-to-end learning jointly where images are generated from ˆz, ˜z and fed
to discriminators for real/fake task.
2 Related Work
Unsupervised learning of visual representation is one of the most challenging
problems in computer vision. There are two primary approaches to unsupervised
learning. The first is the discriminative approach where we use auxiliary tasks
such that ground truth can be generated without labeling. Some examples of
these auxiliary tasks include predicting: the relative location of two patches [2],
ego-motion in videos [15,16], physical signals [17–19].
A more common approach to unsupervised learning is to use a generative
framework. Two types of generative frameworks have been used in the past.
Non-parametric approaches perform matching of an image or patch with the
database for tasks such as texture synthesis [20] or super-resolution [21]. In
this paper, we are interested in developing a parametric model of images. One
common approach is to learn a low-dimensional representation which can be used
to reconstruct an image. Some examples include the deep auto-encoder [22,23]or
Restricted Boltzmann machines (RBMs) [24–28]. However, in most of the above
scenarios it is hard to generate new images since sampling in latent space is not
an easy task. The recently proposed Variational auto-encoders (VAE) [10,11]
tackles this problem by generating images with variational sampling approach.
However, these approaches are restricted to simple datasets such as MNIST. To
generate interpretable images with richer information, the VAE is extended to
be conditioned on captions [29] and graphics code [30]. Besides RBMs and auto-
encoders, there are also many novel generative models in recent literature [31–34].
For example, Dosovitskiy et al. [31] proposed to use CNNs to generate chairs.