Inspur Information AI team sets the best performance in Object Detection in nuScenes Autonomous Driving Dataset
01 November 2022
The multi-camera-based spatiotemporal fusion model architecture greatly enhances object monitoring speed and displacement orientation prediction
SAN JOSE, Calif.—November 1, 2022—Inspur Information, a leading IT infrastructure solutions provider, participated in the latest evaluation of the globally recognized autonomous driving dataset from nuScenes. The Inspur Information AI team won first place in the vision track of the 3D detection task (nuScenes Detection task), raising the key indicator nuScenes Detection Score (NDS) to 62.4%.
Autonomous driving will be totally transformative for the transportation industry, and is a major focus for automotive manufactures and AI companies. Object detection is at the core of autonomous driving technology, with the accuracy and stability of its algorithms constantly being improved by AI research teams. The nuScenes dataset is one of the most respected public datasets in the field of autonomous driving. The data is collected from real autonomous driving scenarios in Boston and Singapore. It is the first dataset that integrates multiple sensors such as cameras, LiDAR and millimeter wave radar to achieve full sensor coverage surrounding the vehicle. The nuScenes dataset provides rich annotation information such as 2D and 3D object annotation, LiDAR point cloud segmentation, and high-precision maps, including 1,000 scenes, 1.4 million camera images, 390,000 frames of LIDAR sweeps, 23 object classes, and 1.4 million object bounding boxes, and the amount of data annotation is more than 7 times higher than that of the KITTI dataset.
The Inspur Information AI team participated in the vision track of the detection task. It is the most competitive track, attracting top AI teams around the world, such as Baidu, Carnegie Mellon University, Hong Kong University of Science and Technology, MIT, Tsinghua University and the University of California, Berkeley.
The vision track of the 3D detection task permits the use of only 6 cameras to provide complete 3D object detection coverage around a vehicle without the use additional sensor information such as LiDAR or millimeter-wave radar. Object detection includes vehicles,
pedestrians, obstacles, traffic signs, traffic lights and other types of objects. In addition to detection, objects must accurately be evaluated for their position, size, orientation, speed and other information. The most challenging aspect is accurately obtaining the true depth and speed of targets using 2D images. If the extracted depth information is inaccurate, 3D perception tasks will become extremely difficult. If the extracted speed information is inaccurate, it can result in dangerous decision-making and planning.
The Inspur Information AI team innovatively developed a multi-camera-based spatial-temporal fusion model architecture (Inspur_DABNet4D). Based on the technical framework of unified conversion of multi-view visual input to the BEV (Bird Eye View) feature space, Inspur Information used data sample enhancement, a depth-enhanced network, a spatiotemporal fusion network, etc., to get more robust and accurate BEV features, and greatly optimize object monitoring speed and displacement orientation prediction.
The multi-camera-based spatial-temporal fusion model architecture has achieved four core technological breakthroughs.
- First, the richer data sample enhancement algorithm maps the ground-truth with real 3D physical coordinates, and provides expansion in the time series, which significantly improves target detection accuracy. The mAP (mean Average Precision) is increased by over 2% on average.
- Second, the more powerful depth enhancement network works to improve depth information, which is difficult to be learned and modeled. This depth prediction is greatly improved through the optimization of a deep network architecture, point cloud data supervision and training, and depth completion, and other technologies.
- Third, a more refined - spatial-temporal fusion network further optimizes the solution to the spatial-temporal information dislocation fusion problem caused by the movement of the vehicle, but also introduces the fusion of the random extraction of sweep frame data and the current frame to enable the synchronous enhancement operation of data samples of different frames. This lets the model learn more refined time series features with end-to-end learning.
- Fourth, Inspur Information designed a more complete unified modeling form. It designed a unified modeling architecture of end-to-end feature extraction, fusion and detection head for driving scenes with a wide perspective and large scale. The architecture is simple in structure, efficient in training and universally applicable in different scenarios. The pre-trained model can replace the self-supervised model at any time, and the test and accuracy improvement can be completed quickly and conveniently.
Thanks to the progress of more advanced algorithms and higher computing power, the results of the 3D object detection task from nuScenes were greatly improved in 2022. The Inspur Information AI team has increased the key indicator NDS to 62.4%, a remarkably achievement considering the best performance on the list at the beginning of the year was 47%.