Читать книгу Cyberphysical Smart Cities Infrastructures. Optimal Operation and Intelligent Decision Making онлайн
67 страница из 94
3.3.2 Language Plus Vision
Now that we know that machines can understand languages and there exist sophisticated models just for this purpose out there [30], it is time to bring another sense into play. One of the most popular ways to show the potential of joint training of vision and language is the image and video captioning [31, 35].
More recently, a new line of work has been introduced to take advantage of this connection. AbbtextVisual question answering (VQA) [17] is the task of receiving an image along with a natural language question about that image as an input and attempting to find the accurate natural language answer for it as the output. The beauty of this task is that both the questions and the answers can be open‐ended and also the questions can target different aspects of the image such as the objects that are present in them, their relationship or relative positions, colors, and background.
Following this research, Singh et al. [36] cleverly added an optical character recognition (OCR) module to the VQA model to enable the agent to read the texts available in the image as well and answer questions asked from them or use the additional context indirectly to answer the question better.