Читать книгу Cyberphysical Smart Cities Infrastructures. Optimal Operation and Intelligent Decision Making онлайн

68 страница из 94

One may ask where the new task stands relative to the previous one. Do agents who can answer questions more intelligent than the ones who deal with captions or not? The answer is yes. In [17], the authors show that VQA agents need a deeper and more detailed understanding of the image and reasoning than models for captioning.

3.3.3 Embodied Visual Recognition

Passive or fixed agents may fail to recognize objects in scenes if they are partially or heavily occluded. Embodiment comes to the rescue here and gifts the possibility of moving in the environment to actively control the viewing position and angle to remove any ambiguity in object shapes and semantics.

Jayaraman and Grauman [37] started to learn representations that will exploit the link between how the agent moves and how it will affect its visual surrounding. To do this they used raw unlabeled videos along with an external GPS sensor that provided the agent's coordinates and trained their model to learn a representation linking these two. So, after this, the agent would have the ability to predict the outcome of its future actions and guess how the scene would look like after moving forward or turning to a side.

Правообладателям