In contrast to recommendations based on collaborative filtering or
textual product attributes, like the number of CPU cores for a laptop,
visual product recommendation needs to capture more than just the
attributes. For instance, it is pretty easy to classify an image as a
“dress”, but it is much more challenging to provide a descriptive
summary of the whole content. For example, besides the category, there
are sub-categories, color, textures, pattern, buttons, different material and
use, like outdoor or business and not to forget (highly) subjective
attributes like “sexy”, “hot” and tags that are not even real words
like “superfunkay”. It can be best described as _what you see is what
you get_, but the problem is that each customer “sees” differently. The one
is interested in the color and only in the color, while another one
likes a particular print and does not care for other details. What
does that mean for a visual recommendation engine?
The short answer is that we need preference-based learning for
satisfied customers. In other words, as soon as a customer expressed
her interest at some concepts, these concepts should be weighted
differently and the results are unique in the sense that they are
strongly correlated with the taste of this customer. For a different
customer, the results might therefore be totally different, if she
focuses on other concepts. In other words, we first need to recognize those
concepts to learn the user preferences and in this blog, we focus on the first part.
So far with the introduction, now let’s get technical.
It is no secret that we are using deep neural networks to learn a
features space for images that can be both used for classification,
categories, and the embedding of whole concepts, attributes but also
user-defined tags. That means we can relate one or more attributes
with a set of images by considering the distance in the learned
feature space. How is this done? Let’s assume that we have a trained
neural network “fnn” which can be fed with images “x_images”, depending
on what we need, we use different layers to get information from
the image. To relate images vs. images, we use the embedding layer and
use some pre-defined distance function to sort items by their
relevance against the reference picture. In case of image vs.
tags, we use the embedding layer for the image and a different
layer for the embedding of the tags.
The scoring of tags is then done by ranking labels according to their affinity score,
which can be the cosine distance of the embedding of the image and the
tags. Ideally, the output should be >> 0.5 for positive matches and <= 0 for negative ones. In case of a simple prediction of the category, we use the softmax layer of the network which outputs a probability distribution for all categories with respect to the image. With a well trained network, the output for the correct class should be >>0.9.
To summarize what we got so far:
(1) (Conceptual) similarity of images
(2) Category prediction of images
(3) Annotation of images
User-defined tags are treated as tags but are weighted
differently because their are noisier and might be highly subjective.
In a simplified way, a concept consists of a set of tags to
describe something. Like objects can be broken into parts. For
instance, a dress consists of color, texture, shape, material, length
and many other, more fine-grained, tags.
If we consider the bottom-up view, we can say that if all tags
for dresses are present in a picture, it should match the concept of
the dress. Of course there are some more constraints, like the spatial
position of the parts, but we omit these details to keep it simple.
Now, let’s get actually technical. To build such a network, we have to
decide what network architecture to use. Maybe it is inception-based,
residual or VGG-like. Let’s say we made the decision and we start
right after the last convolutional layer. Next, we have to decide how
we want to tackle the existing problems.
For simplicity, we focus on these problems:
(1) Multi-class prediction which means we have N classes and want to
predict exactly one of them 1-hot-encoding
(2) Multi-label prediction which means we have M tags and we
wand to predict a subset of the tags
Problem (1) is the most classic one for which a lot of outstanding
results have been published, with AlexNet as the first candidate for a
1,000 category problem. Depending on how we choose categories, we
might end up with about 100 categories in the domain of fashion. The solution
is straightforward, we are using a softmax operation to normalize the
prediction: y_hat = T.exp(input); y_hat /= T.sum(y_hat)
which results in an array of N=100 dimensions, one for each class and
the values add up to one. With this operation, we do not have to care
for opposite forces, since if one category goes up, other have to go
down because of the normalization. This is exactly what we need for
the classification; raise the probability of one class and reduce it
for all others. Of course, related classes, like different kind of
dresses, will be correlated, but at the end one class wins which is
the final prediction: y_category = T.argmax(y_hat).
Problem (2) is more challenging, since now we have to predict the tags
simultaneously. However, there are more constraints than that,
especially for the fashion domain. For instance, a shirt can be
striped or plaid, but not both. Therefore, we have a hierarchy of
tags with the semantic that two tags can be either:
I) disjoint [cannot occur together]
II) correlated [often occur together]
What does it mean for a model? Case (III) is the simplest, since the
prediction of the tag does not depend on any other tags. For case (II)
the presence of one tag might encourage the model to predict another
tag, while for case (I) the model knows that if a particular tag is
present, all disjoint tags should _not_ be predicted. To incorporate
all these constraints into a model can be very complex and might
require an extra inference step for each image to get all predictions.
Thus, we focus on a rather simple model that serves as our example.
The simplest approach is to use an independent classifier per tag
which are then jointly trained.
The most popular classifier is logistic regression that also have the
advantage that tags are usually binary and can be weighted from 0…1
without adjusting the classifier. To train such a model, we have to
encode all labels as a list where all entries are 0 by default and
only those are 1 which represent active tags in the image. In contrast
to the softmax, the output is not normalized: y_tags_hat = T.nnet.sigmoid(input)
which means, we get a prediction from 0..1 for each tag. A good
classifier would push down non-present tags to 0 and the other ones to
1. This can be efficiently optimized with the binary cross-entropy
function: -T.sum(y_tags * T.log(y_tags_hat) + (1 - y_tags) * T.log(1 - y_tags_hat))
This might look horrible but is pretty straightforward to explain.
Let’s assume we want to predict a tag at position 0.
Therefore, y_tags_hat should be 1.0 and y_tags is marked with 1 and since
y_tags_hat comes from the sigmoid the range is from 0 to 1. So, what
happens if the prediction is perfect: 1 * -T.log(1) = 0 and what if
not? For 0.5, the loss is: 1 * -T.log(0.5) ~ 0.69 and for 0.1: 1 *
-T.log(0.1) ~2.3. Thus, the lower the prediction is, the higher is the
error. The minus is required since the log is negative for [0, 1]. The
other case, a tag should be 0 is similar: y_tags_hat should be 0.0
and y_tags is marked with 0: (1 – 0) * -T.log(1 – 1) = 0, (1 – 0) *
-T.log(1 – 0.5) ~ 0.69. It can be easily seen that this is the inverse
of the first part which pushes the prediction down to zero. Stated differently,
the minimization of the cross-entropy function is equivalent to a network
that correctly classifies image tags.
Let’s take a look what we got so far: The category is prediction with
a softmax classifier and the labels with a sigmoid classifier. The
easiest approach is to train them jointly on a dataset and hope that
both loss functions interact in a (mostly) non-destructive way.
Furthermore, as an initial setup, both classifiers will be fed from
the same hidden layer, but we are free to attach a loss function to
any layer we want. For instance, in some cases it might be beneficial to
attach the softmax to a convolutional layer and the sigmoids to the
last hidden layer. The exact details depend on the problem and the
network architecture, which are everything but trivial to decide.
After we decided what loss and architecture to use, we can train the
network. This is done by choosing some optimizer, like Adam or RMSprop
which involves selecting some hyper-parameters like the learning rate.
But now, we can really start the training. Depending on the size of
the data set and the number of labels/categories, the training can
take a while, but finally we get a model to make predictions.
To conceptualize an image, we feed the raw RGB image data to the
network and get all the predictions in return. Depending on the
encoded taxonomy, we might get something like that:
color.blue = 1.0
domain.fashion = 1.0
category.dress = 1.0
pattern.animalprint = 1.0
style.midi = 1.0
style.tight = 1.0
attr.bow = 1.0
attr.quilling = 1.0
which is a reasonable summary of the image, if it is correct ;-), and with
this information, we can form a mental picture of the image. Surely,
some details are missing, but we could easily add more information to
the taxonomy to encode other details.
However, it is no secret that the real-world is not as simple as that.
As we mentioned before, tags can be related, correlated or disjoint
and some can be easily confused. In other words, standard loss
functions won’t hardly get you proper results and usually a great deal
of work is required to encode the constraints of the tag space into the
loss function. For that reason, simple classifiers often serve as a
base-line and/or are combined with other methods to boost the
accuracy. In contrast to the classification, there is no standard way
for multi-label classification and depending on the taxonomy,
engineering a loss function for tags can be very challenging.
Because our Deep Learning Lab is very interested to maximize the customer
satisfaction, we invested a considerable amount of time to build a taxonomy.
We then started a smart annotation process for a big image data set, to semi-automatically tag ten-thousands of images. In parallel, we conducted a lot of research
to find and optimize a loss function that fits our need to conceptualize images and to allow a personalization of the products. The results can soon be seen in our new, deep learning-powered V4, the successor of the old recommender engine. Stay tuned.