In the domain of visual search, with a focus on image similarity for online shops, understanding images means more than just classifying them into coarse categories, like “tops”. In order to get a meaningful description of the whole image, we need to decompose the image into its parts which can only be partly done with a multiclass model. Thus, we need to make sure that we really understand an image from the top to the bottom to serve the best results for customers. For instance, a composed label like “fashion-tops-shirt-striped-long_sleeved” can be broken down into a sequence of classifications of categories: (1) fashion, (2) tops, (3) shirts, and tags: (a) striped, (b) long_sleeved. For categories, each class at each level is disjoint, while tags can occur together (multi-label) and the combination of categories and tags forms a taxonomy with a hierarchical structure. This can be seen as a tree where the categories are the internal nodes of the tree, the tags correspond to the leaves, and the labels (e.g, “fashion-tops-shirt-striped-long_sleeved”) thus reflect a complete path from the root to a leaf.
In previous posts we emphasized that the predictions of categories and tags happen simultaneously by learning a feature space that can be used for both multiclass and multi-label classification. In this sense, our learning model acts like a black box that, given an image, learns all the concepts that characterize it and outputs a (hopefully correct) label. Note that this is a one-time output, occurring when the signal reaches the end of the network.
However, such an approach does not fully exploit the hierarchical structure of our taxonomy. Therefore, it would make sense to employ a model that, given an image, “routes” the image through the taxonomy tree and constructs the final output (i.e., the label) in an iterative manner, going from a coarse classification to a fine-grained one. This is exactly the purpose of feedback-based neural networks .
A feedback neural network is made of recurrent feedforward layers which can be unrolled in time to form a stack of feedforward layers trained to produce the best output at each iteration given a notion of thus-far output. For each iteration, the loss is connected to the output of the last layer in the feedforward module. As a consequence, the network is forced to tackle the entire task at each iteration in order to pass the output to the next one and this makes the network form a representation across iterations in a coarse to fine manner. In other words, each module of the network is forced to predict the outcome, as opposed to a feedforward network where the concepts are build layer by layer and the final layer is used for the classification.
In a nutshell, the feedback approach has several advantages over feedforward.
Early Predictions: Since the outcome is approximated in an iterative manner, a coarse estimation of the output is available before the signal reaches the end of the network, thus it can be provided in a fraction of total inference time. This could be especially useful to speed up queries when only a high-level categorization is required.
Taxonomy Compliance: Predictions given in an episodic (rather than one-time) manner, supported by the coarse-to-fine representation resulting from feedback-based training, naturally conform to a hierarchical structure in the label space. More specifically, early predictions provide a coarse classification of the image, with such coarse classes further refined through successive iterations.
Curriculum Learning: The process of learning simple concepts first and then more complex abstractions (based on those previously learned) has proven an efficient way to speed up both human and machine learning. Iterative predictions provided by the feedback-based model enforce curriculum learning through the episodes of prediction for one query. Note that, in feedforward networks, curriculum learning can only be enforced by presenting the training data in a meaningful order, i.e., with increasing complexity.
In contrast to the original approach, we simplified the implementation by using vanilla conv-bn-relu blocks instead of a recurrent convolution operation. Instead, in order to comply with the taxonomy, we concatenate the output of lower, middle and high layers which are fed into a recurrent fully connected layer with a fixed step-size to predict the output of each level of the hierarchy.
The experiments we conducted confirmed that a feedback-based model is a viable alternative to feedforward architectures for classification, showing excellent results for simple taxonomies. The issue of multi-label classification has not been fully addressed by our experiments, but we noted a positive impact of using the output of the feedback-driven layers. As usual, some classes/tags are more challenging than others due to a large variance of the images and require more fine-tuning. Bottom line, with the insights we got so far, it seems beneficial to pursue the idea of feedback-based networks to fully utilize our fashion taxonomy and to learn feature representations that are more powerful to capture the preferences of customers which allows us to deliver matching results in fewer steps at a higher precision.