The human brain extracts complex information from visual inputs, including objects, their spatial and semantic interrelations, and their interactions with the environment. However, a quantitative ...
Abstract: A good knowledge-based visual question answering (KB-VQA) model requires detailed visual information, semantically clear questions, and relevant external knowledge to address open visual ...
Abstract: The rapidly evolving field of robotics necessitates methods that can facilitate the fusion of multiple modalities. Specifically, when it comes to interacting with tangible objects, ...
If you find any work missing or have any suggestions (papers, implementations, and other resources), feel free to pull requests. We will add the missing papers to this repo as soon as possible. You ...