1 |
Multimodal Semantic Understanding and Navigation in Outdoor Scenes
|
|
|
|
Abstract:
From indoor robotics to automated cars, there is a tremendous growth in the number of robots in our day-to-day life. For instance, products such as smart speakers, wearable technologies, home robots, self-driving cars, and many more smart assistants are to come in the next few years. These robotic systems interact with humans and surrounding environments to perform their designated tasks. Research on robotic perception, visual language navigation, speech recognition, and others drive the aforementioned applications, and there has been significant progress in the past decade. The focus of this thesis is to develop models to tackle some of the challenges and enable better robot perception and navigation systems. Perception system demands to be complex with a multitude of tasks such as understanding human cues and visual perception of the environment. To this end, we propose an approach to address the problem of Object referring (OR) task using spoken language, human gaze, and natural language text. We train and evaluate our method on Cityscapes dataset, which is augmented with human gaze and speech captured in an indoor setup. We observe that the language-guided OR task performance improves with the addition of human-side gaze and speech modalities and with visual scene modalities of RGB, depth, and motion. Next, the thesis focuses on the challenge of the robot navigation system. The vast majority of research is targeted towards indoor or simulated outdoor navigation. Here, we define the problem of language-based robot navigation in the real outdoor environment, which has the first person view to understand and execute the natural language instructions. We create a large-scale dataset with verbal navigation instructions based on Google Street View. Experiments on our dataset show that the proposed approach helps the language-guided automatic wayfinding. Finally what happens to the visual perception system of robots when encountered with poor lighting conditions or camera malfunctioning. Robots can then hear the environment to perceive as humans do. There are limited works in the literature related to sound perception in outdoor environments. We develop an approach to focus on dense semantic object labelling based on binaural sounds from the environment. We propose a novel sensor setup and record a new audio-visual dataset of street scenes with eight binaural microphones and a \ang{360} camera. We also propose two auxiliary tasks, namely, a) a novel task on Spatial Sound Super-resolution, and b) dense depth prediction of the scene. We then formulate the three tasks into one end-to-end multi-tasking network, and the evaluation on our dataset shows how all three tasks are mutually beneficial.
|
|
Keyword:
computer science; Computer Vision; Data processing; Gaze Estimation; info:eu-repo/classification/ddc/004; Natural Language Processing; Sound perception; Vision and Language Navigation
|
|
URL: https://hdl.handle.net/20.500.11850/509287 https://doi.org/10.3929/ethz-b-000509287
|
|
BASE
|
|
Hide details
|
|
2 |
The perception of the French vowel /y/ by Polish learners of French ; De la perception de la voyelle /y/ chez des apprenants polonais du français langue étrangère
|
|
|
|
In: ISSN: 1641-6961 ; Białostockie Archiwum Językowe ; https://halshs.archives-ouvertes.fr/halshs-02289502 ; Białostockie Archiwum Językowe, Pologne, Białystok : Wydawnictwo Uniwersytetu w Białymstoku, 2019, n° 19, p. 91-112 (2019)
|
|
BASE
|
|
Show details
|
|
3 |
The Effects of English Pronunciation Instruction on Listening Skills among Vietnamese Learners
|
|
|
|
In: Masters Theses (2019)
|
|
BASE
|
|
Show details
|
|
5 |
“The sound of society”: A method for investigating sound perception in Cairo
|
|
|
|
In: ISSN: 1745-8927 ; EISSN: 1745-8935 ; Senses and Society ; https://hal.archives-ouvertes.fr/hal-01380972 ; Senses and Society, Taylor & Francis (Routledge), 2016, Contemporary French Sensory Ethnography, 11 (3), pp.298-319. ⟨10.1080/17458927.2016.1195112⟩ (2016)
|
|
BASE
|
|
Show details
|
|
8 |
SimScene : a web-based acoustic scenes simulator
|
|
|
|
In: 1st Web Audio Conference (WAC) ; https://hal.archives-ouvertes.fr/hal-01078098 ; 1st Web Audio Conference (WAC), IRCAM & Mozilla, Jan 2015, Paris, France ; http://wac.ircam.fr/ (2015)
|
|
BASE
|
|
Show details
|
|
10 |
PERCEPTION OF /q/ IN THE ARABIC /q/-/k/ CONTRAST BY NATIVE SPEAKERS OF AMERICAN ENGLISH: A DISCRIMINATION TASK
|
|
|
|
In: Theses (2015)
|
|
BASE
|
|
Show details
|
|
13 |
Intuitive Control of Solid-Interaction Sounds Synthesis: Toward Sonic Metaphors ; Contrôle Intuitif de la Synthèse Sonore d'Interactions Solidiennes : vers les Métaphores Sonores
|
|
|
|
In: https://hal.archives-ouvertes.fr/tel-01121888 ; Son [cs.SD]. Ecole Centrale de Marseille, 2014. Français (2014)
|
|
BASE
|
|
Show details
|
|
14 |
Intuitive control of solid-interaction sound synthesis : toward sonic metaphors ; Contrôle intuitif de la synthèse sonore d’interactions solidiennes : vers les métaphores sonores
|
|
|
|
In: https://tel.archives-ouvertes.fr/tel-01152851 ; Acoustique [physics.class-ph]. Ecole Centrale Marseille, 2014. Français. ⟨NNT : 2014ECDM0012⟩ (2014)
|
|
BASE
|
|
Show details
|
|
15 |
Percepção de plosivas surdas do inglês sob múltiplas manipulações de Voice Onset Time (VOT) em tarefa de idntificação por brasileiros e americanos
|
|
|
|
BASE
|
|
Show details
|
|
16 |
Percepção de plosivas surdas do inglês sob múltiplas manipulações de Voice Onset Time (VOT) em tarefa de idntificação por brasileiros e americanos
|
|
|
|
BASE
|
|
Show details
|
|
|
|