With the commercialization of autonomous driving and assisted driving systems, the demand for high-performance, efficient, and scalable machine learning solutions is becoming more urgent than ever before. Visual perception is a key research area of self-driving that is always attracting a lot of attention since (1) visual data provides much richer information than other sensors; (2) cameras are affordable and pervasive on vehicles as well as other robotic systems; (3) visual foundation models are trending directions in machine learning. This workshop embraces topics around vision-centric and data-driven autonomous driving, including but not limited to the following topics:
After the successful holding of CVPR 2023 1st VCAD Workshop, ECCV 2024 2nd VCAD Workshop will take place on Sept 30, 2024 from 8:55am to 1:30pm at MiCo Milano, Suite 9.
Please submit your draft to vision-centric-autonomous-driving@outlook.com. Submissions must follow the ECCV 2024 template and will be peer-reviewed in a single-blind manner. Submission must be no more than 14 pages (excluding references). Accepted papers will be presented in the form of posters, with several papers being selected for oral presentations.
8:55 am – 9:00 am: Workshop Introduction
9:00 am – 9:45 am: Jose M. Alvarez - Towards Robust and Reliable AV with Foundational Models
Abstract: In this talk, I will present our latest progress in vision-centric and data-driven methods for autonomous vehicles. I will introduce Hydra, our end-to-end architecture using a multi-target distillation paradigm, and then I will explain how we leverage MLLMs within the autonomous driving ecosystem. From the car, with OmniDrive, to the cloud, with SSE, a MLLM data selection framework. Finally, some recent methods for leveraging data generation towards improving the robustness of our algorithms.
Dr. Jose M. Alvarez is a research director at NVIDIA, leading the Autonomous Vehicle Applied Research team. His team maximizes the impact of the latest research advances on the AV product. Jose's research interests include model-centric and data-centric deep learning toward more efficient and scalable systems. Jose completed his Ph.D. in computer science in Barcelona, specializing in road-scene understanding for autonomous driving when datasets were very limited. He also worked as a postdoctoral researcher at NYU under Yann LeCunn.
9:45 am – 10:30 am: Sergio Casas - Perceiving and Forecasting Anything with Self-Supervision
Abstract: Perceiving and forecasting the world are critical tasks for self-driving. Supervised approaches leverage annotated object labels to learn a model of the world -- traditionally with object detections and trajectory predictions, or semantic bird's-eye-view occupancy fields. However, these annotations are expensive and typically limited to a set of predefined categories that do not cover everything we might encounter on the road. Instead, we explore methods to learn to perceive and forecast with self-supervision from LiDAR data. We show that such self-supervised world models can be easily and effectively transferred to downstream tasks.
Dr. Sergio Casas is a Senior Staff Tech Lead Manager at Waabi, where he leads the Perception and Behavior Reasoning team. He completed his Ph.D. at the University of Toronto, under the supervision of Prof. Raquel Urtasun. His research lies at the intersection of computer vision, machine learning, and robotics.
10:30 am – 11:00 am: Poster Sessions
11:00 am – 11:45 am: Boyi Li - Tokenize the World into Object-level Knowledge to Address Long-tail Events in Autonomous Driving
Abstract: The autonomous driving industry is increasingly adopting end-to-end learning from sensory inputs to minimize human biases in system design. Traditional end-to-end driving models, however, suffer from long-tail events due to rare or unseen inputs within their training distributions. To address this, we propose TOKEN, a novel Multi-Modal Large Language Model (MM-LLM) that tokenizes the world into object-level knowledge, enabling better utilization of LLM's reasoning capabilities to enhance autonomous vehicle planning in long-tail scenarios. TOKEN effectively alleviates data scarcity and inefficient tokenization by leveraging a traditional end-to-end driving model to produce condensed and semantically enriched representations of the scene, which are optimized for LLM planning compatibility through deliberate representation and reasoning alignment training stages. Our results demonstrate that TOKEN excels in grounding, reasoning, and planning capabilities, outperforming existing frameworks with a 27% reduction in trajectory L2 error and a 39% decrease in collision rates in long-tail scenarios.
Dr. Boyi Li is a Research Scientist at NVIDIA Research and a Postdoctoral Scholar at UC Berkeley, where she is advised by Prof. Jitendra Malik and Prof. Trevor Darrell. She received her Ph.D. from Cornell University, under the guidance of Prof. Serge Belongie and Prof. Kilian Q. Weinberger. Dr. Li’s research focuses on learning from multimodal data, developing generalizable algorithms and interactive intelligent systems, and concentrating on areas such as reasoning, large language models, generative models, and robotics. Her work involves aligning representations from multimodal data, including 2D pixels, 3D geometry, language, and audio.
11:45 am – 12:30 pm: Fergal Cotter - Building Digital Cities
Abstract: We explore the space of neural reconstruction methods for resimulation and its importance for self-driving. At Wayve, we challenge the status quo of self-driving and advocate heavily for an end-to-end “AV2.0” approach, where the driving model connects directly to sensor inputs and control outputs. In the same vein, we advocate for doing resimulation in the wild with the barest of requirements. In this talk, we look at how we’ve put this into practice at Wayve in building our Ghost-Gym resimulation engine on top of PRISM-1.
Dr. Fergal Cotter is a Staff Applied Scientist at Wayve, where he has mostly looked at the Perception side of End-to-End driving. In particular, he has focused on how perception and geometry information can be distilled into driving models, as well as ensuring a good grasp of what is happening in the training data. Recently, he has explored the resimulation problem and how driving models can be benchmarked without ever taking them on the road. Before joining Wayve, Dr. Cotter completed his PhD at the University of Cambridge, where he worked on using the wavelet and frequency domain as a good representation and learning space for image understanding.
12:30 pm – 1:30 pm: Oral Sessions
For any questions, please contact us at yimingli@nyu.edu.