ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

Keynote Speakers

Jose M. Alvarez
NVIDIA

Sergio Casas
Waabi

Boyi Li
NVIDIA & UC Berkeley

Fergal Cotter
Wayve

Workshop Schedule

8:55 am – 9:00 am: Workshop Introduction

9:00 am – 9:45 am: Jose M. Alvarez - Towards Robust and Reliable AV with Foundational Models

Abstract: In this talk, I will present our latest progress in vision-centric and data-driven methods for autonomous vehicles. I will introduce Hydra, our end-to-end architecture using a multi-target distillation paradigm, and then I will explain how we leverage MLLMs within the autonomous driving ecosystem. From the car, with OmniDrive, to the cloud, with SSE, a MLLM data selection framework. Finally, some recent methods for leveraging data generation towards improving the robustness of our algorithms.

Dr. Jose M. Alvarez is a research director at NVIDIA, leading the Autonomous Vehicle Applied Research team. His team maximizes the impact of the latest research advances on the AV product. Jose's research interests include model-centric and data-centric deep learning toward more efficient and scalable systems. Jose completed his Ph.D. in computer science in Barcelona, specializing in road-scene understanding for autonomous driving when datasets were very limited. He also worked as a postdoctoral researcher at NYU under Yann LeCunn.

9:45 am – 10:30 am: Sergio Casas - Perceiving and Forecasting Anything with Self-Supervision

Abstract: Perceiving and forecasting the world are critical tasks for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or semantic bird's-eye-view occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we explore methods to learn to perceive and forecast with self-supervision from LiDAR data. We show that such self-supervised world models can be easily and effectively transferred to downstream tasks.

Dr. Sergio Casas is a Senior Staff Tech Lead Manager at Waabi, where he leads the Perception and Behavior Reasoning team. He completed his Ph.D. at the University of Toronto, under the supervision of Prof. Raquel Urtasun. His research lies at the intersection of computer vision, machine learning, and robotics.

10:30 am – 11:00 am: Poster Sessions

Foundation Models for Amodal Video Instance Segmentation in Automated Driving
Jasmin Breitenstein, Franz Jünger, Andreas Bär, Tim Fingscheidt [PDF]
A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts
Aurel Pjetri, Stefano Caprasecca, Leonardo Taccari, Matteo Simoncini, Henrique Pineiro Monteagudo, Walter Wallace, Douglas Coimbra de Andrade, Francesco Sambo, Andrew David Bagdanov [PDF]
Accuracy Evaluation and Improvement of the Calibration of Stereo Vision Datasets
Kai Cordes, Hellward Broszio [PDF]
Robust Unsupervised Optical Flow Under Low-Visibility Conditions
Libo Long, Tianran Liu, Robert Laganière, Jochen Lang [PDF]
RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou [PDF]
FSMDet: Vision-guided feature diffusion for fully sparse 3D detector
Tianran Liu, Morteza Mousa Pasandi, Robert Laganiere [PDF]
Beyond Entropy: Style Transfer Guided Single Image Continual Test-Time Adaptation
Younggeol Cho, Youngrae Kim, Dongman Lee [PDF]
MapNeXt: Revisiting Training and Scaling Practices for Vectorized HD Map Construction
Toyota Li [PDF]
Cross-Spectral Gated-RGB Stereo Depth Estimation
Samuel Brucker, Stefanie Walz, Mario Bijelic, Felix Heide [PDF]
High-Order Evolving Graphs for Enhanced Representation of Traffic Dynamics
Aditya Humnabadkar, Arindam Sikdar, Benjamin Cave, Huaizhong Zhang, Paul Bakaki, Ardhendu Behera [PDF]
Robust Bird’s Eye View Segmentation by Adapting DINOv2
Merve Rabia Barın, Görkay Aydemir, Fatma Güney [PDF]

11:00 am – 11:45 am: Boyi Li - Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

Abstract: The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM's reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. TOKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model to produce condensed and semantically enriched representations of the scene, which are optimized for LLM planning compatibility through deliberate representation and reasoning alignment training stages. Our results demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios.

Dr. Boyi Li is a Research Scientist at NVIDIA Research and a Postdoctoral Scholar at UC Berkeley, where she is advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She received her Ph.D. from Cornell University, under the guidance of Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Dr. Li’s research focuses on learning from multimodal data, developing generalizable algorithms and interactive intelligent systems, and concentrating on areas such as reasoning, large language models, generative models, and robotics. Her work involves aligning representations from multimodal data, including 2D pixels, 3D geometry, language, and audio.

11:45 am – 12:30 pm: Fergal Cotter - Building Digital Cities

Abstract: We explore the space of neural reconstruction methods for resimulation and its importance for self-driving. At Wayve, we challenge the status quo of self-driving and advocate heavily for an end-to-end “AV2.0” approach, where the driving model connects directly to sensor inputs and control outputs. In the same vein, we advocate for doing resimulation in the wild with the barest of requirements. In this talk, we look at how we’ve put this into practice at Wayve in building our Ghost-Gym resimulation engine on top of PRISM-1.

Dr. Fergal Cotter is a Staff Applied Scientist at Wayve, where he has mostly looked at the Perception side of End-to-End driving. In particular, he has focused on how perception and geometry information can be distilled into driving models, as well as ensuring a good grasp of what is happening in the training data. Recently, he has explored the resimulation problem and how driving models can be benchmarked without ever taking them on the road. Before joining Wayve, Dr. Cotter completed his PhD at the University of Cambridge, where he worked on using the wavelet and frequency domain as a good representation and learning space for image understanding.

12:30 pm – 1:30 pm: Oral Sessions

12:30 pm – 12:35 pm: VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection
12:35 pm – 12:40 pm: Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving
12:40 pm – 12:45 pm: ExelMap: Explainable Element-based HD-Map Change Detection and Update
12:45 pm – 12:50 pm: Long-Tailed 3D Detection via 2D Late Fusion
12:50 pm – 12:55 pm: UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation
12:55 pm – 1:00 pm: TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem
1:00 pm – 1:05 pm: Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry
1:05 pm – 1:10 pm: What Matters to Enhance Traffic Rule Compliance of Imitation Learning for End-to-End Autonomous Driving
1:10 pm – 1:15 pm: S3Track: Self-supervised Tracking with Soft Assignment Flow
1:15 pm – 1:20 pm: CraftedVoxels: Improving 3D detector accuracy at the Starting Line
1:20 pm – 1:25 pm: Accurate 3D Automatic Annotation of Traffic Lights and Signs for Autonomous Driving

ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

Introduction

Key Dates

Keynote Speakers

Workshop Schedule

Organizers

Contact