Artificial intelligence Deep Learning Lab Picalike Products

Visual Concept Learning

In contrast to recommendations based on collaborative filtering or
textual product attributes, like the number of CPU cores for a laptop,
visual product recommendation needs to capture more than just the
attributes. For instance, it is pretty easy to classify an image as a
“dress”, but it is much more challenging to provide a descriptive
summary of the whole content. For example, besides the category, there
are sub-categories, color, textures, pattern, buttons, different material and
use, like outdoor or business and not to forget (highly) subjective
attributes like “sexy”, “hot” and tags that are not even real words
like “superfunkay”. It can be best described as _what you see is what
you get_, but the problem is that each customer “sees” differently. The one
is interested in the color and only in the color, while another one
likes a particular print and does not care for other details. What
does that mean for a visual recommendation engine?

The short answer is that we need preference-based learning for
satisfied customers. In other words, as soon as a customer expressed
her interest at some concepts, these concepts should be weighted
differently and the results are unique in the sense that they are
strongly correlated with the taste of this customer. For a different
customer, the results might therefore be totally different, if she
focuses on other concepts. In other words, we first need to recognize those
concepts to learn the user preferences and in this blog, we focus on the first part.
So far with the introduction, now let’s get technical.

It is no secret that we are using deep neural networks to learn a
features space for images that can be both used for classification,
categories, and the embedding of whole concepts, attributes but also
user-defined tags. That means we can relate one or more attributes
with a set of images by considering the distance in the learned
feature space. How is this done? Let’s assume that we have a trained
neural network “fnn” which can be fed with images “x_images”, depending
on what we need, we use different layers to get information from
the image. To relate images vs. images, we use the embedding layer and
use some pre-defined distance function to sort items by their
relevance against the reference picture. In case of image vs.
tags, we use the embedding layer for the image and a different
layer for the embedding of the tags.

The scoring of tags is then done by ranking labels according to their affinity score,
which can be the cosine distance of the embedding of the image and the
tags. Ideally, the output should be >> 0.5 for positive matches and <= 0 for negative ones. In case of a simple prediction of the category, we use the softmax layer of the network which outputs a probability distribution for all categories with respect to the image. With a well trained network, the output for the correct class should be >>0.9.

To summarize what we got so far:
(1) (Conceptual) similarity of images
(2) Category prediction of images
(3) Annotation of images

User-defined tags are treated as tags but are weighted
differently because their are noisier and might be highly subjective.

In a simplified way, a concept consists of a set of tags to
describe something. Like objects can be broken into parts. For
instance, a dress consists of color, texture, shape, material, length
and many other, more fine-grained, tags.
If we consider the bottom-up view, we can say that if all tags
for dresses are present in a picture, it should match the concept of
the dress. Of course there are some more constraints, like the spatial
position of the parts, but we omit these details to keep it simple.

Now, let’s get actually technical. To build such a network, we have to
decide what network architecture to use. Maybe it is inception-based,
residual or VGG-like. Let’s say we made the decision and we start
right after the last convolutional layer. Next, we have to decide how
we want to tackle the existing problems.

For simplicity, we focus on these problems:
(1) Multi-class prediction which means we have N classes and want to
predict exactly one of them 1-hot-encoding
(2) Multi-label prediction which means we have M tags and we
wand to predict a subset of the tags

Problem (1) is the most classic one for which a lot of outstanding
results have been published, with AlexNet as the first candidate for a
1,000 category problem. Depending on how we choose categories, we
might end up with about 100 categories in the domain of fashion. The solution
is straightforward, we are using a softmax operation to normalize the
prediction: y_hat = T.exp(input); y_hat /= T.sum(y_hat)
which results in an array of N=100 dimensions, one for each class and
the values add up to one. With this operation, we do not have to care
for opposite forces, since if one category goes up, other have to go
down because of the normalization. This is exactly what we need for
the classification; raise the probability of one class and reduce it
for all others. Of course, related classes, like different kind of
dresses, will be correlated, but at the end one class wins which is
the final prediction: y_category = T.argmax(y_hat).

Problem (2) is more challenging, since now we have to predict the tags
simultaneously. However, there are more constraints than that,
especially for the fashion domain. For instance, a shirt can be
striped or plaid, but not both. Therefore, we have a hierarchy of
tags with the semantic that two tags can be either:
I) disjoint [cannot occur together]
II) correlated [often occur together]
III) unrelated
What does it mean for a model? Case (III) is the simplest, since the
prediction of the tag does not depend on any other tags. For case (II)
the presence of one tag might encourage the model to predict another
tag, while for case (I) the model knows that if a particular tag is
present, all disjoint tags should _not_ be predicted. To incorporate
all these constraints into a model can be very complex and might
require an extra inference step for each image to get all predictions.
Thus, we focus on a rather simple model that serves as our example.
The simplest approach is to use an independent classifier per tag
which are then jointly trained.

The most popular classifier is logistic regression that also have the
advantage that tags are usually binary and can be weighted from 0…1
without adjusting the classifier. To train such a model, we have to
encode all labels as a list where all entries are 0 by default and
only those are 1 which represent active tags in the image. In contrast
to the softmax, the output is not normalized: y_tags_hat = T.nnet.sigmoid(input)
which means, we get a prediction from 0..1 for each tag. A good
classifier would push down non-present tags to 0 and the other ones to
1. This can be efficiently optimized with the binary cross-entropy

function: -T.sum(y_tags * T.log(y_tags_hat) + (1 - y_tags) * T.log(1 - y_tags_hat))

This might look horrible but is pretty straightforward to explain.
Let’s assume we want to predict a tag at position 0.

Therefore, y_tags_hat[0] should be 1.0 and y_tags is marked with 1 and since
y_tags_hat comes from the sigmoid the range is from 0 to 1. So, what
happens if the prediction is perfect: 1 * -T.log(1) = 0 and what if
not? For 0.5, the loss is: 1 * -T.log(0.5) ~ 0.69 and for 0.1: 1 *
-T.log(0.1) ~2.3. Thus, the lower the prediction is, the higher is the
error. The minus is required since the log is negative for [0, 1]. The
other case, a tag should be 0 is similar: y_tags_hat[1] should be 0.0
and y_tags is marked with 0: (1 – 0) * -T.log(1 – 1) = 0, (1 – 0) *
-T.log(1 – 0.5) ~ 0.69. It can be easily seen that this is the inverse
of the first part which pushes the prediction down to zero. Stated differently,
the minimization of the cross-entropy function is equivalent to a network
that correctly classifies image tags.

Let’s take a look what we got so far: The category is prediction with
a softmax classifier and the labels with a sigmoid classifier. The
easiest approach is to train them jointly on a dataset and hope that
both loss functions interact in a (mostly) non-destructive way.
Furthermore, as an initial setup, both classifiers will be fed from
the same hidden layer, but we are free to attach a loss function to
any layer we want. For instance, in some cases it might be beneficial to
attach the softmax to a convolutional layer and the sigmoids to the
last hidden layer. The exact details depend on the problem and the
network architecture, which are everything but trivial to decide.

After we decided what loss and architecture to use, we can train the
network. This is done by choosing some optimizer, like Adam or RMSprop
which involves selecting some hyper-parameters like the learning rate.
But now, we can really start the training. Depending on the size of
the data set and the number of labels/categories, the training can
take a while, but finally we get a model to make predictions.

To conceptualize an image, we feed the raw RGB image data to the
network and get all the predictions in return. Depending on the
encoded taxonomy, we might get something like that: = 1.0 = 1.0
category.dress = 1.0
pattern.animalprint = 1.0
style.midi = 1.0
style.tight = 1.0
attr.bow = 1.0
attr.quilling = 1.0

which is a reasonable summary of the image, if it is correct ;-), and with
this information, we can form a mental picture of the image. Surely,
some details are missing, but we could easily add more information to
the taxonomy to encode other details.

However, it is no secret that the real-world is not as simple as that.
As we mentioned before, tags can be related, correlated or disjoint
and some can be easily confused. In other words, standard loss
functions won’t hardly get you proper results and usually a great deal
of work is required to encode the constraints of the tag space into the
loss function. For that reason, simple classifiers often serve as a
base-line and/or are combined with other methods to boost the
accuracy. In contrast to the classification, there is no standard way
for multi-label classification and depending on the taxonomy,
engineering a loss function for tags can be very challenging.

Because our Deep Learning Lab is very interested to maximize the customer
satisfaction, we invested a considerable amount of time to build a taxonomy.
We then started a smart annotation process for a big image data set, to semi-automatically tag ten-thousands of images. In parallel, we conducted a lot of research
to find and optimize a loss function that fits our need to conceptualize images and to allow a personalization of the products. The results can soon be seen in our new, deep learning-powered V4, the successor of the old recommender engine. Stay tuned.

Computer Vision Picalike Products

Computer Vision for Outfit Recommendation

There are traditionally two ways of creating an outfit recommendation. Manually created or automatically & data driven. The manual way uses stylists or fashionistas who pick products and create high quality recommendations based on their expertise and know how. The data driven approach uses data from various sources to generate outfits.

Both have pros and cons:

Manually picked outfits generate high quality recommendations, who generally perform better. However – this approach is time consuming, not scalable and when the chosen products are sold out or sizes are out of stock, the outfit starts to fall apart or vanishes.

Data driven outfits may use shopping carts data, behavioral data and/or any other statistics with textual mining to create outfits. There are a few flaws as it is often not clear if statistics reflect the intention of an outfit to work in concert together. If someone buys a jacket, shoes and a bag, they might be bought for 3 different occasions or even different people and not have the aim of being worn together. So the Jacket might be for a daily use, the shoes might be for the weekend party and the bag for a upcoming trip or as a present for a friend.

However, there are several ways on how to reduce the drawbacks by using rules, e.g. combining products of certain categories, search for products with matching descriptions, brands, etc.

But there is a third way. It is one that allows an easy creation and a scalable use of high quality outfits. Using hand picked, high quality outfits and combining them with computer vision technology we can create a scalable, easy and safeguarding your investment on time and money: just as in the hand picked outfit, someone can select some products and create a bundle.


Scale your outfits

Using our visual technology system we can find best alternatives if a product is out of stock or, which is even more interesting is, create new outfits, which follow the same fashion rules as the reference outfit. We can simply scale several outfit, extend the style and cover your entire inventory with few outfits. It is basically combining best of both worlds, as it extends initial fashion intelligence and tranfers its reach on the entire inventory. Here is our Visualytics Suite Manual. Check the last pages for some inspiration on how to use our looks API.

Check this video to see how, with only a few clicks, you can create a long lasting, scalable, transferable and of high quality look which can have a high impact on your conversion.

Artificial intelligence Deep Learning Lab Programming

Multi-Task Learning: Learning Beyond Classification

This year was a state-of-the-art year for Deep Learning especially for the domain of images. The score of models for the ImageNet dataset went continually down and even if some of those models are not easily applicable in practice, there is still a clear trend. Despite the fact that a recent trend combines ideas from NLP and image data, which can be seen as multi-modal learning, most of the reported results were about supervised classification. To be clear, there is nothing wrong with predicting the correct label, but depending on the label diversity of the underlying dataset, such a model just learns (non-)linear boundaries to separate the labels into some high-dimensional space without preserving the intra-class variance. Therefore, there is no guarantee that such models are good at providing general purpose features for other tasks, like transfer learning.

This can be easily demonstrated with an example from the fashion domain: To correctly predict shirt vs. pants, a model just has to learn enough discriminative features to drive the loss to zero and the work is done. That means with a standard loss function, the negative log-likelihood for instance, it is fairly reasonable to assume that even the learned convolutional filters are not very sophisticated, because the decision boundary is very simple. Of course this also depends on the kind of images that are used for learning, but the problem remains pretty simple.

However, in case of fine-grained labels, like “striped shirt” or “dotted shirt”, a model not only has to learn to separate high-level concepts, but also need to understand low-level details to solve the task. This is one reason why pre-trained ImageNet models serve as very good feature extractors, because the labels combine very different concepts, but also categories that are very similar, like different breeds of dogs. Therefore, those models are forced to learn a broad range of concepts to correctly classify unseen images, which is the reason why the extracted features work so well also for many other problems.

At the very begin, before we trained our own models, we also analyzed and tested the extracted features of a pre-trained ImageNet model and the tests confirmed that classifiers trained with those features already deliver a very good performance. Nevertheless, a lot of fine-grained concepts that are present in the fashion domain are only weakly developed in ImageNet models which is why we began to work on our own ‘FashionNet’ that is supposed to serve as an end-to-end feature extractor for different tasks. To overcome the problem that a loss function that is solely based on low-entropy labels, like classification, does not guarantee features that can be used for a wide range of applications, a different loss function is required. For example, to train a specialized model that can predict if a shirt is stripped requires different features than one that only needs to predict if the category is a shirt or not.

For that and other reasons, our journey to find a good FashionNet has not come to an end yet, but we want to give at least some examples how such a model could be trained: The most basic approach is multi task learning which means instead of using a single loss function, several objective functions are combined, for instance, categories, super categories, or an embedding of textual attributes and so forth. The approach forces the model to be good at recognizing simple categories, but also to focus on other aspects and low-level details in parallel which strongly influences the kind of features such a model learn. A different choice would be to use a supervised objective, but to add a regularizer that ensures that a certain criteria is followed, like preserving the structure of images in a category with an additional embedding.

In a nutshell, if the problem is simple, an supervised (convolutional) network with a softmax often suffices, but for more sophisticated problems, or in case of a generic feature extractor, a model with more capacity is required that is forced to learn features that can be used in a broader context. However, since labels, regardless of how fine-grained they are, carry very few information, we continue to pursuit the track of unsupervised learning to break image data into concepts, rather than describing it with simple labels.

Artificial intelligence Deep Learning Lab

ReLU Was Yesterday, Tomorrow Comes ELU

Modern deep neural networks hardly use sigmoid or tanh units, because they saturate and thus the gradient might vanish in case of many layers. The answer to the problem was ReLU, a rectified linear unit, that is fast and non-saturating: max(x, 0). The results of models with these units were pretty impressive and it become very quickly the standard.

However, even if it is not possible for ReLUs to saturate, those units can turn “dead” which means they are never activated because the pre-activation value is always negative. For such units, no gradient can flow through the net. Furthermore, since the output of a ReLU unit is always non-negative, their mean activation is always positive.

The Problem

Without going into the details, see [1] for a thorough analysis, a positive mean introduces a bias for the next layer which can slow down the learning. As a solution Hochreiter et. al [1] introduced a new type of neuron called “Exponential Linear Unit”, ELU for short:

f(x) = x * (x > 0) + (x < 0) * (alpha * (T.exp(x) - 1))

In plain English, it acts like a ReLU unit if x is positive, but for negative values it is a function bounded by a fixed value “-1″, for alpha=1. This behavior helps to push the mean activation of neurons closer to zero which is beneficial for learning and it helps to learn representations that are more robust to noise[1, Chapter3].

From Theory To Practice

To compare both types of neurons, we use a standard ConvNet for classification with a fixed initialization. Therefore, the only difference is that we used ReLU/ELU units in all fully-connected layers. Each network was trained for 10 epochs and we used a held-out validation set to compare the scores:

1 28.770 25.430
2 44.790 40.420
3 56.660 47.030
4 63.610 61.280
5 67.080 62.420
6 69.030 64.590
7 76.680 72.770
8 75.260 73.970
9 77.270 76.680
10 77.400 76.390

Plot of the score for the first 10 epochs

Plot of the score for the first 10 epochs

The claim of the paper, that the learning is faster, can be confirmed by the results of the experiment. For instance, at epoch 3, the ELU net has an accuracy of 56.66%, while the ReLU net only has one of 47.03%. The ReLU net catches up later but during all our tests, the ELU net outperformed the ReLU net at least by a small margin at the end.


Despite the fact that we performed only limited experiments, ELU units actually seem to learn faster than other units and they are able to learn models which are at least as good as ReLU-like networks. Therefore, we plan to further investigate the use this kind of neuron for our production networks.


Artificial intelligence Deep Learning Lab

Complex Loss Functions: Loopings With Theano

After more than a year of marriage, we are still in love with Theano, our machine learning wife. Why? Once, you have done some basic work for model setup and data handling, she allows rapid prototyping and to focus on the problem and not spending hours with the fiddling of gradient expressions. However, most tutorials and examples focus on rather simple models, like the classification of some categories, which is good for learning how things work, but it kind of idealizes reality. For instance, in the domain of visual technology, we rarely have problems that can be reduced to classification tasks. The challenges we are facing have more to do with similarity, ranking and especially learning of concepts. We illustrate the hurdles, by sketching a model to embed images with similar concepts into one feature space. For the sake of simplicity we assume that we have a pre-trained convnet that converts an RGB image “x” into a feature vector “y” of length 128: y = f(x). Furthermore, we have a set of 16 abstract concepts to describe an image (like “red sneakers”, “striped shirts”, …). The aim of the model is to push an image close to its corresponding concepts, therefore minimizing the distance, and to pull it away from its counterparts. Each concept can be thought of as a lookup table “Wt” where each concept is represented by a vector of a fixed size, like 32. The image features “y” are also projected into this space via an embedding matrix “Wy”.

 (y_e - Wt[1])
 (y_e - Wt[4])
 (y_e - Wt[6])

At the same time, we need to push y_e away from non-active concepts:

 max(0, 1 - (y_e - Wt[8]))
 max(0, 1 - (y_e - Wt[9]))
 max(0, 1 - (y_e - Wt[10]))

and so forth. The penalizing continues until the image embedding and its counterpart has a distance that is larger than one. Since the post is about loops in Theano, we mainly illustrate the implementation of the loss function and do not to describe the whole model. So, what do we need to perform a single training step? First, the features of the image and second the corresponding concepts, where each of them is described by positive integer labels like [1, 4, 6]. And third, a subset of non-active concepts like [8, 9, 10]. The push part of the loss function in Theano looks like that:

 W_t = theano.shared(np.random.uniform(-bound, +bound, shape=(16, 32))) # random initialization of the embedding
 y_e = T.vector()
 pos_label1, pos_label2, pos_label2 = T.iscalar(), T.iscalar(), T.iscalar()
 loss_push = T.sum((y_e - W_t[pos_label1])**2) + T.sum((y_e - W_t[pos_label2])**2) 
             + T.sum((y_e - W_t[pos_label3])**2)

The limitation is obvious, if we have a variable number of active concepts, which is very likely, we have to fix the maximum number of active concepts and juggle with binary flags. However, with loops the problem could be solved in a much more nifty kind of way. So, loops in Theano, is that possible at all? The answer is, yes of course(!), but it requires a bit of rethinking. All the magic is provided by a function named “scan” which comes, according to the official documentation “with many bells and whistles”[1]. But in our case, the use is straightforward. The push part of the loss with scan:

 pos_labels = T.ivector()
 result, _ = theano.scan(lambda v: T.sum((y_e - W_t[v])**2), sequences=pos_labels)
 loss_push = T.sum(result)

Pretty awesome, right? To calculate the loss we compile a function:

 func = theano.function([y_e, pos_labels], loss_push)

Then, we can evaluate the loss, not only for three concepts, but for all kind of combinations:

 obj_push = func(np.zeros(32), [1, 4, 6])
 obj_push = func(np.zeros(32), [1, 4, 6, 8])

Finally, let us explain what “scan” exactly does: The idea is to call a function where the argument is generated by a list given by “sequence”. In our case, the function is f(v): sum((y – w[v])**2) and the sequence is [1, 4, 6]. The scan function then generates calls to f() with the next item of the sequence: f(1), f(4), f(6). Each of these function calls returns the distance of the image features and the concept with the given id 1, 4 and 6 and is then combined into a “list” result which can be treated like any other tensor object in Theano. In a nutshell, simple loops in Theano are not scary at all if one understands the semantics of it. It should be noted that the capabilities of scan are much more powerful, but also that there are limitations, such as no support for the Cartesian product of two sequences. Thus, it is a good idea to avoid complex loops in Theano if there is a different solution for the problem, also for the sake of performance overhead. However, if a loop is required, it can be done pretty easily as demonstrated.

[1] <>





Computer Vision

Need for speed in [image] processing: Integral images – a neglected way to speed up algorithms

(integral image aka. summed area table)

Integral images are used to efficiently sum up pixels in areas of an image. This, for example, is necessary to calculate mean values of image regions and used in image filtering. The summed area table 1 is calculated by summing up all pixels from origin to the given pixel.


The sum of all pixels in the green area in the figure above for example is (without an integral image) given by


But if there are plenty of areas to sum up, its much more efficient to use an integral image and just use addition and subtraction.


This method was introduced by Frank Crow in 1984, but it is often forgotten as a simple way to speed up an algorithm.

Artificial intelligence Deep Learning Lab

Classification And Feature Extraction – Why Catalog Images Are So Different

When we focus on image classification and understanding with Deep Learning pipelines, most of the time the utilized images can be considered as “natural” or at least cluttered. In other words, images are very often photos made by users or professionals with a background that contributes to a scene. This is in stark contrast to most catalog images where the picture often contains only a single product, maybe centered, with an all-white background. This post describes some issues that need to be considered, if one trains a classifier for catalog images.

As noted earlier, the background of an natural image is essential for the correct prediction of the assigned label. For product images, it is usually the opposite which means the background is largely a distraction and mainly used for aesthetic reasons, but it is often irrelevant for a correct prediction of the label. To sum it up, the statistics of catalog images are very different compared to “natural” images. This includes the color distribution, especially at the borders, primitive shapes and therefore also first-order statistics like the mean and the variance.

The next issue is the recognition of very similar categories like for instance outdoor shoes vs. running shoes in the domain of fashion. Even for humans, it can be challenging to tell what is what, depending on the image size, quality and the model of the shoe which brings us to the realms of Deep Learning.

Most of the ConvNet models today resize the image to a fixed size, like 256 x 256 and ignore the aspect ratio. If the image is already very small, like 96 x 144 for instance, upscaling will not help, because even in the original image, the details can be hardly recognized neither by a human, nor a computer.
That means high-quality and larger images are required that preserve enough details when they are downscaled to the network size. In addition, for some categories a high diversity of images is required to capture its most relevant concepts.

The last issue is how an image is feed to the network. Usually, the image is resized to a fixed size N x N and a random center crop is extracted and fed to the network. This rather simple method works, when there is only a single object on the image and this object is centered, or at least a center crop would return a large overlap with the object. What is the problem of this method? For starters, a quadratic image (N x N) can distort the semantic meaning of an image category so much that a classification fails. For instance, some shorts might look like pants when the image is resized. In other words, for some domains, like fashion, the aspect ratio is important for the correct prediction of the label.

In a nutshell, the domain of catalog images is rather special which means that the standard learning pipeline for Deep Learning definitely needs some adjustments. Because this issue needs special attention, the picalike Deep Learning Lab has decided to dedicate some of its resources to optimize ConvNets for these kind of images and and to provide dedicated ConvNet models for those domains.

Artificial intelligence Computer Vision Deep Learning Lab Picalike Products

Photo enhancement with machine learning

Although we started our deep learning lab only a few weeks ago, we would like to share one of many achievements: an intelligent free of charge image enhancement tool that does not require any registration – called the Photo Enhancer.

Photos taken by smartphones are strongly affected by color changes through illumination and relatively poor camera quality. Colors in photos often do not appear the same way as in reality. For us this was something we needed to work on, since our similarity search, amongst others, uses colors for similarity calculation. Consequently one of the first tasks of our deep learning lab was to create a system, which predicts the correct illumination and adjusts colors to generate a better version of the photo.

To achieve this, we at first created a data set with a lot of photos, taken by different smartphones and digital cameras. With different photo software, we created better, real-live versions of the photo and saved the parameters that were used to generate the enhanced version.

Our next step was to set up a learning system, by finding the best generalizing parameters. These were used to enhance the images as a test set.

After a short training phase, we were able to achieve significant improvements with the model parameters. We decided to create a free online tool to where you can upload up to 5 images at once (each smaller than 2MB) to get 3 suggested improved versions of each uploaded image.

Test it at:

To give you an example, we uploaded the black and blue dress, that has been subject to lengthy discussion in the past few weeks. See the results for yourself.

Black & Blue Dress enhanced

Enhancing the black & blue dress

Artificial intelligence Deep Learning Lab

Theano – A Covolutional Face-Lift

This year, Friday the 13th was not the Jason-kind of day but actually
a pretty good day. Why? Because there was a new release candidate for
our favorite Machine Learning library Theano. The new version 0.7
brings a lot of new and cool stuff, plus lots of bugfixes. And from
our experiences, release candidates of Theano are very well tested and stable
and thus we decided to give it a try.

As usual, there was nothing to complain. The upgrade went smoothly and
after 5 minutes, we were able to hack some code. The most notable
highlight (for us), except for better GPU support on Windows – which we do not
use – is the integration of a new back-end library for convolutions and pooling.
Those are the operations which are heavily used by big ConvNets and therefore very important!

The support of the CuDNN[1] back-end is still marked as
experimental[2] but we wanted at least take a closer look at the new
interface, to find out, if we can get it to work. With the steps
described at [2], we were able to use the new GPU back-end in no time
and we were also able to train a smaller proof-of-concept ConvNet
without any (software) problems.

Compared to the old convolution code used in Theano, CuDNN seems to
give those operations a great boost regarding the performance. We
don’t want to give a benchmark here, but the integration of the new GPU
back-end seems to be very successful, at least with our
hardware, and is definitely worth further investigations.


Artificial intelligence Computer Vision Deep Learning Lab

One Image Is Worth 1,000 Labels

Very recently there was an interview[0] with Yann LeCun about Deep Learning which lead to open letter[1]. The letter raised the question if industries with limited amount of data need different approaches, especially unsupervised ones, to make sense of their data.

To add to this discussion, we would like to mention the recent hype of big, deep, convolutional networks -ConvNets- that are used by many companies today. The drawbacks of these systems are that they are very data hungry and that they require labeled data. In other words, if a company has millions of samples, but not corresponding labels, or a labeled dataset, but the size of it is rather small, ConvNets are no (real) option. Since almost every deep learning system that is used today is supervised, unsupervised end-to-end alternatives larger images are rare and usually no drop-in replacements for the supervised model.

In the domain of fashion, we have neither problems with limited datasets, nor with missing labels. However, the use of labels is rather limited. Why? Without a doubt it is useful to predict the -previously unknown- category of an image, for instance shirt, blouse, coat, …, but then we need to a deeper, more conceptual description of the image. That means in order to get an understanding of the image, we need hierarchical features, to describe the concepts of them that go far beyond a simple label like ‘coat’. For these features we use Deep Learning, but not in a strictly supervised way and this is why we cannot simply use traditional ConvNets.

The reason is that ConvNets try to separate object classes, for instance shirts and coats, but also some high-level concepts that might be present in both classes, for instance, buttons on clothes, just to name one. But of course, there are thousands of other examples like that. A different perspective is that a label does not tell much about the details of the image. For instance, a coat has a color, different shapes, texture and can be made of different materials. All these features are completely ignored by the label, because the ultimate goal is just to decide if the image is *some* kind of coat or not.

So, in the domain of fashion, at least for most approaches, like a similarity search, the use of labels is rather limited because such models mostly learn features to separate classes. The result is already useful, but as the aim is not to describe all (important) details of an image, the learned features are unlikely to be optimal for fine-grained approaches like a visual similarity search.

Most of the offered services for images today try to predict something, a category, or tags. Deep Learning in this area is clearly focused on supervised methods. However, we agree with the open letter in that special domains, or for companies with limited data, or when labels are not sufficient for the task at hand, as for the fashion domain, unsupervised learning is very important.

Since the revolution of Deep Learning started with unsupervised learning, there are plenty of methods available. However most of them cannot be easily generalized to full-size images. For instance, it is known that features of a pre-trained ConvNet can be used for a broad range of applications, but to train such a net, labels are required. This leads to a chicken-egg problem and if we use hand-crafted features from traditional computer vision, the expressive power might be limited.

Bottom line, for services like similarity search, we need to learn features from images that describe the concepts in the data and that are able to disentangle most explaining factors without the necessity to use any labels. Because of the importance of this topic, we decided that this is the first project for the brand new Deep Learning lab! We have initiated first tests with very impressive results and welcome you to take part in the discussion.