
DriveMLM integrates multi-modal inputs to improve autonomous vehicle planning and explainability.
Researchers at Tsinghua University introduced DriveMLM, a multi-modal large language model framework designed to serve as a behavioral planner for autonomous vehicles. Unlike conventional control systems, DriveMLM combines multi-view camera images, LiDAR point clouds, system telemetry, and natural language instructions to create aligned behavioral planning states suitable for real-time driving control. The system produces natural language explanations for its decisions, enhancing transparency and trustworthiness.
Evaluated on the CARLA Town05 Long benchmark, DriveMLM significantly outperformed the Apollo baseline, achieving a driving score of 76.1 and the highest miles per intervention ratio among tested systems. Its ability to interpret complex traffic scenarios, reason over motion decisions, and comply with natural language commands marks a leap toward human-like reasoning in self-driving technology. By integrating perception, planning, and user-guided instructions, DriveMLM represents a promising approach for the next generation of safe and explainable autonomous vehicles.