Context-Based Interpretability for Visual Attention using AI


How does the visual representation of the world is structured by the brain? How could it be useful to react adaptively to new situations and scenarios? What if an autonomous system could learn that behavior?

Recent Computer Vision and Deep Learning techniques enable the possibility to solve complex visual problems. One desirable property for most applications is the dynamic adaptation ability of the system to unknown contexts. This ability could be useful in a software production environment, where the data dynamics are business and human behavior dependent, providing flexibility while keeping robustness. This concept upgrades a trainable solution to a self-adaptive solution.

Information, as modeled by humans, has some structure and hierarchy. Therefore, it can be defined as latent representations of high level concepts, which are represented by their visual structure. Capsule Networks are shown to be a promising technology that allows us to explore new possibilities at Computer Vision field. Moreover, they make the assumption that visual world can be modeled by a parse-tree like structure. This latent representation is dynamically instantiated for each new input stimulus, for example, at each new video frame of a sequence. The information flows through the dynamic structure by means of routing elements called capsules. This mechanism could be used to model every hierarchical structure designed by the human being with the aim of modeling the visual world: contexts, scenarios as well as intermediate concepts. Imagine that you could define a conceptual graph as your world model, insert it in your Deep Learning system and see how it self-organizes the information each time is being required for service, profiling the performance for each context.

Many applications such as image segmentation, object detection or object tracking could be benefited from this techniques. Visual attention is one of the most interesting problems to solve while modeling latent structure and context. The reason behind this is that human attention is complex and depends on visual semantics, temporal dynamics, particular experience, motivations, actions, tasks, etc. Most of them, even their combinations, define contexts where to perform the visual attention task. These contexts could be modeled using the latent structure represented by capsules’ routing mechanisms. Specifically, visual attention applied to autonomous driving is an attractive challenge due to its behavior variability across observers, then it provides multiple possibilities when trying to model contexts and to structure information. For example, a driving context could be defined as a sunny morning at downtown, a cloudy evening at countryside or a rainy night on a highway.

Interpretability is another desirable property for Deep Learning systems. Computer Vision applications are slightly interpretable by humans as they’re common tasks for us, but the decisions made by algorithms are not fully understandable yet. Autonomous driving systems could be more expressive if they held that kind of properties. Capsule Networks variants and the information structure models implemented with them are the enablers to build interpretable systems at concept and latent structure levels. For example, they could describe relationships between low level features (color, optical flow, semantics, etc.), intermediate level environment properties (evening, rainy, downtown, etc.) and their combinations in abstract levels of information, which can be denoted as contexts.

New horizons: “What if my data or my system is required to change its information structure fast, dynamically and to other complex structures?” This is the real case of a situation where data structure could change based on business needs, affecting to dataset construction (new video recordings and context labels) and model retraining (for new information maps). This is a challenge proposed for One Shot Learning line of research.