VIZARD: Reliable Visual Localization for Autonomous Vehicles in Urban Outdoor Environments

Mathias Buerki, Lukas Schaupp, Marcyn Dymczyk, Renaud Dube, Cesar Cadena, Roland Siegwart, and Juan Nieto

IEEE Intelligent Vehicles Symposium (IV) 2019

Changes in appearance is one of the main sources of failure in visual localization systems in outdoor environments. To address this challenge, we present VIZARD, a visual localization system for urban outdoor environments. By combining a local localization algorithm with the use of multi-session maps, a high localization recall can be achieved across vastly different appearance conditions. The fusion of the visual localization constraints with wheel-odometry in a state estimation framework further guarantees smooth and accurate pose estimates. In an extensive experimental evaluation on several hundreds of driving kilometers in challenging urban outdoor environments, we analyze the recall and accuracy of our localization system, investigate its key parameters and boundary conditions, and compare different types of feature descriptors. Our results show that VIZARD is able to achieve nearly 100% recall with a localization accuracy below 0.5m under varying outdoor appearance conditions, including at night-time.

pdf   video

Title = {Map Management for Efficient Long-Term Visual Localization in Outdoor Environments},
Author = {M. Buerki and L. Schaupp and M. Dymczyk and R. Dube and C. Cadena and R. Siegwart and J. Nieto},
Fullauthor = {Mathias Buerki and Lukas Schaupp and Marcyn Dymczyk and Renaud Dube and Cesar Cadena and Roland Siegwart and Juan Nieto},
Booktitle = {{IEEE} Intelligent Vehicles Symposium ({IV})},
Month = {June},
Year = {2019},

Object Classification Based on Unsupervised Learned Multi-Modal Features for Overcoming Sensor Failures

Julia Nitsch, Juan Nieto, Roland Siegwart, Max Schmidt, and Cesar Cadena

IEEE International Conference on Robotics and Automation (ICRA) 2019

For autonomous driving applications it is critical to know which type of road users and road side infrastructure are present to plan driving manoeuvres accordingly. Therefore autonomous cars are equipped with different sensor modalities to robustly perceive its environment. However, for classification modules based on machine learning techniques it is challenging to overcome unseen sensor noise. This work presents an object classification module operating on unsupervised learned multi-modal features with the ability to overcome gradual or total sensor failure. A two stage approach composed of an unsupervised feature training and a uni-modal and multimodal classifiers training is presented. We propose a simple but effective decision module switching between uni-modal and multi-modal classifiers based on the closeness in the feature space to the training data. Evaluations on the ModelNet 40 data set show that the proposed approach has a 14% accuracy gain compared to a late fusion approach operating on a noisy point cloud data and a 6% accuracy gain when operating on noisy image data.


Title = {Object Classification Based on Unsupervised Learned Multi-Modal Features for Overcoming Sensor Failures},
Author = {J. Nitsch and J. Nieto and R. Siegwart and M. Schmidt and C. Cadena},
Fullauthor = {Julia Nitsch and Juan Nieto and Roland Siegwart and Max Schmidt and Cesar Cadena},
Booktitle = {{IEEE} International Conference on Robotics and Automation ({ICRA})},
Month = {May},
Year = {2019},

SegMap: Segment-based Mapping and Localization using Data-driven Descriptors

Renaud Dube, Andrei Cramariuc1, Daniel Dugas, Hannes Sommer, Marcin Dymczyk, Juan Nieto, Roland Siegwart, and Cesar Cadena

International Journal of Robotics Research (IJRR) 2019

Precisely estimating a robot’s pose in a prior, global map is a fundamental capability for mobile robotics, e.g. autonomous driving or exploration in disaster zones. This task, however, remains challenging in unstructured, dynamic environments, where local features are not discriminative enough and global scene descriptors only provide coarse information. We therefore present SegMap: a map representation solution for localization and mapping based on the extraction of segments in 3D point clouds. Working at the level of segments offers increased invariance to view-point and local structural changes, and facilitates real-time processing of large-scale 3D data. SegMap exploits a single compact data-driven descriptor for performing multiple tasks: global localization, 3D dense map reconstruction, and semantic information extraction. The performance of SegMap is evaluated in multiple urban driving and search and rescue experiments. We show that the learned SegMap descriptor has superior segment retrieval capabilities, compared to state-of-the-art handcrafted descriptors. In consequence, we achieve a higher localization accuracy and a 6% increase in recall over state-of-the-art. These segment-based localizations allow us to reduce the open-loop odometry drift by up to 50%. SegMap is open-source available along with easy to run demonstrations.


 title = {{SegMap}: Segment-based Mapping and Localization using Data-driven Descriptors},
 author = {R. Dube and A. Cramariuc and D. Dugas and H. Sommer and M. Dymczyk and J. Nieto and R. Siegwart and C. Cadena},
 fullauthor ={Renaud Dube and Andrei Cramariuc and Daniel Dugas and Hannes Sommer and Marcin Dymczyk and Juan Nieto and Roland Siegwart and Cesar Cadena},
 journal = {{International Journal of Robotics Research}},
 year = {2019},
 volume = {XX},
 number = {X},
 pages  = {1--16},

Multiple Hypothesis Semantic Mapping for Robust Data Association

Lukas Bernreiter, Abel Gawel, Hannes Sommer, Juan Nieto, Roland Siegwart and Cesar Cadena

IEEE Robotics and Automation Letters, 2019

We present a semantic mapping approach with multiple hypothesis tracking for data association. As semantic information has the potential to overcome ambiguity in measurements and place recognition, it forms an eminent modality for autonomous systems. This is particularly evident in urban scenarios with several similar-looking surroundings. Nevertheless, it requires the handling of a non-Gaussian and discrete random variable coming from object detectors. Previous methods facilitate semantic information for global localization and data association to reduce the instance ambiguity between the landmarks. However, many of these approaches do not deal with the creation of completely globally consistent representations of the environment and typically do not scale well. We utilize multiple hypothesis trees to derive a probabilistic data association for semantic measurements by means of position, instance, and class to create a semantic representation. We propose an optimized mapping method and make use of a pose graph to derive a novel semantic SLAM solution. Furthermore, we show that semantic covisibility graphs allow for a precise place recognition in urban environments. We verify our approach using real-world outdoor dataset and demonstrate an average drift reduction of 33% w.r.t. the raw odometry source. Moreover, our approach produces 55% less hypotheses on average than a regular multiple hypothesis approach.


title={Multiple Hypothesis Semantic Mapping for Robust Data Association}, 
author={L. {Bernreiter} and A. {Gawel} and H. {Sommer} and J. {Nieto} and R. {Siegwart} and C. {Cadena}}, 
journal={{IEEE Robotics and Automation Letters}}, 

Empty Cities: Image Inpainting for a Dynamic-Object-Invariant Space

Berta Bescos, Jose Neira, Roland Siegwart, and Cesar Cadena

IEEE International Conference on Robotics and Automation (ICRA) 2019

In this paper we present an end-to-end deep learning framework to turn images that show dynamic content, such as vehicles or pedestrians, into realistic static frames. This objective encounters two main challenges: detecting all the dynamic objects, and inpainting the static occluded background with plausible imagery. The former challenge is addressed by the use of a convolutional network that learns a multiclass semantic segmentation of the image. The second problem is approached with a conditional generative adversarial model that, taking as input the original dynamic image and its dynamic/static binary mask, is capable of generating the final static image. These generated images can be used for applications such as augmented reality or vision-based robot localization purposes. To validate our approach, we show both qualitative and quantitative comparisons against other state-of-the-art inpainting methods by removing the dynamic objects and hallucinating the static structure behind them. Furthermore, to demonstrate the potential of our results, we carry out pilot experiments that show the benefits of our proposal for visual place recognition.

pdf   website   code   video

Title = {Empty Cities: Image Inpainting for a Dynamic-Object-Invariant Space},
Author = {B. Bescos and J. Neira and R. Siegwart and C. Cadena},
Fullauthor = {Berta Bescos and Jose Neira and Roland Siegwart and Cesar Cadena},
Booktitle = {{IEEE} International Conference on Robotics and Automation ({ICRA})},
Month = {May},
Year = {2019},

Modular Sensor Fusion for Semantic Segmentation

Hermann Blum, Abel Gawel, Roland Siegwart and Cesar Cadena

IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) 2018

Sensor fusion is a fundamental process in robotic systems as it extends the perceptual range and increases robustness in real-world operations. Current multi-sensor deep learning based semantic segmentation approaches do not provide robustness to under-performing classes in one modality, or require a specific architecture with access to the full aligned multi-sensor training data. In this work, we analyze statistical fusion approaches for semantic segmentation that overcome these drawbacks while keeping a competitive performance. The studied approaches are modular by construction, allowing to have different training sets per modality and only a much smaller subset is needed to calibrate the statistical models. We evaluate a range of statistical fusion approaches and report their performance against state-of-the-art baselines on both real-world and simulated data. In our experiments, the approach improves performance in IoU over the best single modality segmentation results by up to 5%. We make all implementations and configurations publicly available.

pdf   code

Title = {Modular Sensor Fusion for Semantic Segmentation},
Author = {Blum, H. and Gawel, A. and Siegwart, R. and Cadena, C.},
Fullauthor = {Blum, Hermann and Gawel, Abel and Siegwart, Roland and Cadena, Cesar},
Booktitle = {2018 {IEEE/RSJ} International Conference on Intelligent Robots and Systems ({IROS})}, 
Month = {October},
Year = {2018},

Fusion Scheme for Semantic and Instance-level Segmentation

Arthur Daniel Costea, Andra Petrovai, Sergiu Nedevschi

Proceedings of 2018 IEEE 21th International Conference on Intelligent Transportation Systems (ITSC 2018), Maui, Hawaii, USA, 4-7 Nov. 2018, pp. 3469-3475

A powerful scene understanding can be achieved by combining the tasks of semantic segmentation and instance level recognition. Considering that these tasks are complementary, we propose a multi-objective fusion scheme which leverages the capabilities of each task: pixel level semantic segmentation performs well in background classification and delimiting foreground objects from background, while instance level segmentation excels in recognizing and classifying objects as a whole. We use a fully convolutional residual network together with a feature pyramid network in order to achieve both semantic segmentation and Mask R-CNN based instance level recognition. We introduce a novel heuristic fusion approach for panoptic segmentation. The instance and semantic segmentation output of the network is fused into a panoptic segmentation based on object sub-category class and instance propagation guidance by semantic segmentation for more general classes. The proposed solution achieves significant improvements in semantic object segmentation and object mask boundaries refinement at low computational costs.


Super-sensor for 360-degree Environment Perception: Point Cloud Segmentation Using Image Features

R. Varga, A.D. Costea, H. Florea, I. Giosan, S. Nedevschi

Proceedings of 2017 IEEE 20th International Conference on Intelligent Transportation Systems (ITSC 2017), Yokohama, Japan, 16-19 Oct. 2017,  pp. 1-8

This paper describes a super-sensor that enables 360-degree environment perception for automated vehicles in urban traffic scenarios. We use four fisheye cameras, four 360-degree LIDARs and a GPS/IMU sensor mounted on an automated vehicle to build a super-sensor that offers an enhanced low-level representation of the environment by harmonizing all the available sensor measurements. Individual sensors cannot provide a robust 360-degree perception due to their limitations: field of view, range, orientation, number of scanning rays, etc. The novelty of this work consists of segmenting the 3D LIDAR point cloud by associating it with the 2D image semantic segmentation. Another contribution is the sensor configuration that enables 360-degree environment perception. The following steps are involved in the process: calibration, timestamp synchronization, fisheye image unwarping, motion correction of LIDAR points, point cloud projection onto the images and semantic segmentation of images. The enhanced low-level representation will improve the high-level perception environment tasks such as object detection, classification and tracking.


Semantic segmentation-based stereo reconstruction with statistically improved long range accuracy

V.C. Miclea, S. Nedevschi

Proceedings of 2017 IEEE Intelligent Vehicles Symposium (IV 17), Los Angeles, CA, USA, 11-14 June 2017, pp. 1795-1802

Lately stereo matching has become a key aspect in autonomous driving, providing highly accurate solutions at relatively low cost. Top approaches on state of the art benchmarks rely on learning mechanisms such as convolutional neural networks (ConvNets) to boost matching accuracy. We propose a new real-time stereo reconstruction method that uses a ConvNet for semantically segmenting the driving scene. In a ”divide and conquer” approach this segmentation enables us to split the large heterogeneous traffic scene into smaller regions with similar features. We use the segmentation results to enhance Census Transform with an optimal census mask and the SGM energy optimization step with an optimal P1 penalty for each predicted class. Additionally, we improve the sub-pixel accuracy of the stereo matching by finding optimal interpolation functions for each particular segment class. In both cases we propose new stochastic optimization methods based on genetic algorithms that can incrementally adjust the parameters for better solutions. Tests performed on Kitti and real traffic scenarios show that our method outperforms the accuracy of previous solutions.


Semi-Automatic Image Annotation of Street Scenes

Andra Petrovai, Arthur D. Costea and Sergiu Nedevschi

Proceedings of 2017 IEEE Intelligent Vehicles Symposium (IV 17), Los Angeles, CA, USA, 11-14 June 2017, pp. 448-455

Scene labeling enables very sophisticated and powerful applications for autonomous driving. Training classifiers for this task would not be possible without the existence of large datasets of pixelwise labeled images. Manually annotating a large number of images is an expensive and time consuming process. In this paper, we propose a new semi-automatic annotation tool for scene labeling tailored for autonomous driving. This tool significantly reduces the effort of the annotator and also the time spent to annotate the data, while at the same time it offers the necessary features to produce precise pixel-level semantic labeling. The main contribution of our work represents the development of a complex annotation framework able to generate automatic annotations for 20 classes, which the user can control and modify accordingly. Automatic annotations are obtained in two separate ways. First, we employ a pixelwise fully-connected Conditional Random Field (CRF). Second, we perform grouping of similar neighboring superpixels based on 2D appearance and 3D information using a boosted classifier. Polygons represent the manual correction mechanism for the automatic annotations.