ECCV 2024 2nd Workshop on Vision-Centric Autonomous Driving (VCAD)

Introduction

With the commercialization of autonomous driving and assisted driving systems, the demand for high-performance, efficient, and scalable machine learning solutions is becoming more urgent than ever before. Visual perception is a key research area of self-driving that is always attracting a lot of attention since (1) visual data provides much richer information than other sensors; (2) cameras are affordable and pervasive on vehicles as well as other robotic systems; (3) visual foundation models are trending directions in machine learning. This workshop embraces topics around vision-centric and data-driven autonomous driving, including but not limited to the following topics:

Key Dates

After the successful holding of CVPR 2023 1st VCAD Workshop, ECCV 2024 2nd VCAD Workshop will take place on Sept 30, 2024 from 8:55am to 1:30pm at MiCo Milano, Suite 9.

Please submit your draft to vision-centric-autonomous-driving@outlook.com. Submissions must follow the ECCV 2024 template and will be peer-reviewed in a single-blind manner. Submission must be no more than 14 pages (excluding references). Accepted papers will be presented in the form of posters, with several papers being selected for oral presentations.

Keynote Speakers

Workshop Schedule

8:55 am – 9:00 am: Workshop Introduction

9:00 am – 9:45 am: Jose M. Alvarez - Towards Robust and Reliable AV with Foundational Models

Abstract: In this talk, I will present our latest progress in vision-centric and data-driven methods for autonomous vehicles. I will introduce Hydra, our end-to-end architecture using a multi-target distillation paradigm, and then I will explain how we leverage MLLMs within the autonomous driving ecosystem. From the car, with OmniDrive, to the cloud, with SSE, a MLLM data selection framework. Finally, some recent methods for leveraging data generation towards improving the robustness of our algorithms.

Dr. Jose M. Alvarez is a research director at NVIDIA, leading the Autonomous Vehicle Applied Research team. His team maximizes the impact of the latest research advances on the AV product. Jose's research interests include model-centric and data-centric deep learning toward more efficient and scalable systems. Jose completed his Ph.D. in computer science in Barcelona, specializing in road-scene understanding for autonomous driving when datasets were very limited. He also worked as a postdoctoral researcher at NYU under Yann LeCunn.

9:45 am – 10:30 am: Sergio Casas - Perceiving and Forecasting Anything with Self-Supervision

Abstract: Perceiving and forecasting the world are critical tasks for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or semantic bird's-eye-view occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we explore methods to learn to perceive and forecast with self-supervision from LiDAR data. We show that such self-supervised world models can be easily and effectively transferred to downstream tasks.

Dr. Sergio Casas is a Senior Staff Tech Lead Manager at Waabi, where he leads the Perception and Behavior Reasoning team. He completed his Ph.D. at the University of Toronto, under the supervision of Prof. Raquel Urtasun. His research lies at the intersection of computer vision, machine learning, and robotics.

10:30 am – 11:00 am: Poster Sessions

  • Foundation Models for Amodal Video Instance Segmentation in Automated Driving
    Jasmin Breitenstein, Franz Jünger, Andreas Bär, Tim Fingscheidt [PDF]
  • A New Dataset for Monocular Depth Estimation Under Viewpoint Shifts
    Aurel Pjetri, Stefano Caprasecca, Leonardo Taccari, Matteo Simoncini, Henrique Pineiro Monteagudo, Walter Wallace, Douglas Coimbra de Andrade, Francesco Sambo, Andrew David Bagdanov [PDF]
  • Accuracy Evaluation and Improvement of the Calibration of Stereo Vision Datasets
    Kai Cordes, Hellward Broszio [PDF]
  • Robust Unsupervised Optical Flow Under Low-Visibility Conditions
    Libo Long, Tianran Liu, Robert Laganière, Jochen Lang [PDF]
  • RoDUS: Robust Decomposition of Static and Dynamic Elements in Urban Scenes
    Thang-Anh-Quan Nguyen, Luis Roldão, Nathan Piasco, Moussab Bennehar, Dzmitry Tsishkou [PDF]
  • FSMDet: Vision-guided feature diffusion for fully sparse 3D detector
    Tianran Liu, Morteza Mousa Pasandi, Robert Laganiere [PDF]
  • Beyond Entropy: Style Transfer Guided Single Image Continual Test-Time Adaptation
    Younggeol Cho, Youngrae Kim, Dongman Lee [PDF]
  • MapNeXt: Revisiting Training and Scaling Practices for Vectorized HD Map Construction
    Toyota Li [PDF]
  • Cross-Spectral Gated-RGB Stereo Depth Estimation
    Samuel Brucker, Stefanie Walz, Mario Bijelic, Felix Heide [PDF]
  • High-Order Evolving Graphs for Enhanced Representation of Traffic Dynamics
    Aditya Humnabadkar, Arindam Sikdar, Benjamin Cave, Huaizhong Zhang, Paul Bakaki, Ardhendu Behera [PDF]
  • Robust Bird’s Eye View Segmentation by Adapting DINOv2
    Merve Rabia Barın, Görkay Aydemir, Fatma Güney [PDF]

11:00 am – 11:45 am: Boyi Li - Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving

Abstract: The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM's reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. TOKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model to produce condensed and semantically enriched representations of the scene, which are optimized for LLM planning compatibility through deliberate representation and reasoning alignment training stages. Our results demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios.

Dr. Boyi Li is a Research Scientist at NVIDIA Research and a Postdoctoral Scholar at UC Berkeley, where she is advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She received her Ph.D. from Cornell University, under the guidance of Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Dr. Li’s research focuses on learning from multimodal data, developing generalizable algorithms and interactive intelligent systems, and concentrating on areas such as reasoning, large language models, generative models, and robotics. Her work involves aligning representations from multimodal data, including 2D pixels, 3D geometry, language, and audio.

11:45 am – 12:30 pm: Fergal Cotter - Building Digital Cities

Abstract: We explore the space of neural reconstruction methods for resimulation and its importance for self-driving. At Wayve, we challenge the status quo of self-driving and advocate heavily for an end-to-end “AV2.0” approach, where the driving model connects directly to sensor inputs and control outputs. In the same vein, we advocate for doing resimulation in the wild with the barest of requirements. In this talk, we look at how we’ve put this into practice at Wayve in building our Ghost-Gym resimulation engine on top of PRISM-1.

Dr. Fergal Cotter is a Staff Applied Scientist at Wayve, where he has mostly looked at the Perception side of End-to-End driving. In particular, he has focused on how perception and geometry information can be distilled into driving models, as well as ensuring a good grasp of what is happening in the training data. Recently, he has explored the resimulation problem and how driving models can be benchmarked without ever taking them on the road. Before joining Wayve, Dr. Cotter completed his PhD at the University of Cambridge, where he worked on using the wavelet and frequency domain as a good representation and learning space for image understanding.

12:30 pm – 1:30 pm: Oral Sessions

  • 12:30 pm – 12:35 pm: VL4AD: Vision-Language Models Improve Pixel-wise Anomaly Detection
  • 12:35 pm – 12:40 pm: Explanation for Trajectory Planning using Multi-modal Large Language Model for Autonomous Driving
  • 12:40 pm – 12:45 pm: ExelMap: Explainable Element-based HD-Map Change Detection and Update
  • 12:45 pm – 12:50 pm: Long-Tailed 3D Detection via 2D Late Fusion
  • 12:50 pm – 12:55 pm: UL-VIO: Ultra-lightweight Visual-Inertial Odometry with Noise Robust Test-time Adaptation
  • 12:55 pm – 1:00 pm: TopoMaskV2: Enhanced Instance-Mask-Based Formulation for the Road Topology Problem
  • 1:00 pm – 1:05 pm: Causal Transformer for Fusion and Pose Estimation in Deep Visual Inertial Odometry
  • 1:05 pm – 1:10 pm: What Matters to Enhance Traffic Rule Compliance of Imitation Learning for End-to-End Autonomous Driving
  • 1:10 pm – 1:15 pm: S3Track: Self-supervised Tracking with Soft Assignment Flow
  • 1:15 pm – 1:20 pm: CraftedVoxels: Improving 3D detector accuracy at the Starting Line
  • 1:20 pm – 1:25 pm: Accurate 3D Automatic Annotation of Traffic Lights and Signs for Autonomous Driving

Organizers

Contact

For any questions, please contact us at yimingli@nyu.edu.