How the Zoox robotaxi predicts everything, everywhere, all at once


We humans often lament that we cannot predict the future, but perhaps we don’t give ourselves quite enough credit. With sufficient practice, our short-term predictive skills become truly remarkable.

Driving is a good example, particularly in urban environments. Navigating through a city, you become aware of a colossal number of dynamic aspects in your surroundings. The other cars — some moving, some stationary — pedestrians, cyclists, traffic lights changing. As you drive, your mind is generating predictions of how the universe around you is likely to manifest: “that car looks likely to pull out in front of me”; “that pedestrian is about to sleepwalk off the sidewalk – be ready to hit the brake”; “the front wheels of that parked car have just turned, so it’s about to move”.

Jesse Levinson, co-founder and CTO of Zoox, on the development of fully autonomous vehicles for mobility-as-a-service

Your power of prediction and anticipation throws a protective buffer zone around you, your passengers, and everyone in your vicinity as you travel from A to B. It is a broad yet very nuanced power, making it incredibly hard to recreate in real-world robotics applications.

Nevertheless, the teams at Zoox have achieved noteworthy success.

The integration of cutting-edge hardware, sensor technology, and bespoke machine learning (ML) approaches has resulted in an autonomous robotaxi that can predict the trajectories of vehicles, people, and even animals in its surroundings, as far as 8 seconds into the future — more than enough to enable the vehicle to make sensible and safe driving decisions.

“Predicting the future — the intentions and movements of other agents in the scene — is a core component of safe, autonomous driving,” says Kai Wang, director of the Zoox Prediction team.

Perceiving, predicting, planning

The AI stack at the center of the Zoox driving system broadly consists of three processes, which occur in order: perception, prediction, and planning. These equate to seeing the world and how everything around the vehicle is currently moving, predicting how everything will move next, and deciding how to move from A to B given those predictions.

The Perception team gathers high-resolution data from the vehicle’s dozens of sensors, which include visual cameras, LiDAR, radar, and longwave-infrared cameras. These sensors, positioned high on the four corners of the vehicle, provide an overlapping, 360-degree field of view that can extend for over a hundred meters. To borrow a popular phrase, this vehicle can see everything, everywhere, all at once.

Related content

Advanced machine learning systems help autonomous vehicles react to unexpected changes.

The robotaxi already contains a detailed semantic map of its environment, called the Zoox Road Network (ZRN), which means it understands everything about local infrastructure, road rules, speed limits, intersection layouts, locations of traffic signals, and so on.

Perception quickly identifies and classifies the other cars, pedestrians, and cyclists in the scene, which are dubbed “agents.” And crucially, it tracks each agent’s velocity and current trajectory. These data are then combined with the ZRN to provide the Zoox vehicle with an incredibly detailed understanding of its environment.

Before these combined data are passed to Prediction, they are instantly boiled down to their essentials, into a format optimized for machine learning. To this end, what Prediction ultimately operates on is a top-down, spatially accurate graphical depiction of the vehicle and all the relevant dynamic and static aspects of its environment: a machine-readable, birds-eye representation of the scene with the robotaxi at the center.

“We draw everything into a 2D image and present it to a convolutional neural network [CNN], which in turn determines what distances matter, what relationships between agents matter, and so on,” says Wang.

Learning from data-rich images

While a human can get the gist of this map, such as the relative positions of all the vehicles (represented by boxes) and pedestrians (different, smaller boxes) in the scene, it is not designed for human consumption, explains Andres Morales, staff software engineer.

A complex scene is converted into an image with many layers, each representing different semantic information. The result is fed into a convolutional neural network to generate predictions.

“This is not an RGB image. It’s got about 60 channels, or layers, which also include semantic information,” he notes. “For example, because someone holding a smartphone tends to behave differently, we might have one channel that represents a pedestrian holding their phone as a ‘1’ and a pedestrian with no phone as a ‘0’.”

From this data-rich image, the ML system produces a probability distribution of potential trajectories for each and every dynamic agent in the scene, from trucks right down to that pet dog milling around near the crosswalk.

These predictions consider not only the current trajectory of each agent, but also include factors such as how cars are expected to behave on given road layouts, what the traffic lights are doing, the workings of crosswalks, and so on.

An example of a set of predictions for a truck navigating a 3-way intersection. The green boxes represent where the agent could be up to 6 seconds into the future, while the blue box represents where the agent actually went. Each path is a possible future generated by the Prediction system, with an associated likelihood.

These predictions are typically up to about 8 seconds into the future, but they are constantly recalculated every tenth of a second as new information is delivered from Perception.

These weighted predictions are delivered to the Planner aspect of the AI stack — the vehicle’s executive decision-maker — which uses those predictions to help it decide how the Zoox vehicle will operate safely.

From perception through to planning, the whole process is working in real-time; this robotaxi has lightning-quick reactions, should it need them.

Related content

Predicting the future trajectory of a moving agent can be easy when the past trajectory continues smoothly but is challenging when complex interactions with other agents are involved. Recent deep learning approaches for trajectory prediction show promising performance and partially attribute this to successful reasoning about agent-agent interactions. However, it remains unclear which features such black-box

The team can be confident of its predictions because it has a vast pool of data with which to train its ML algorithms — millions of road miles of high-resolution sensor data collected by the Zoox test fleet: Toyota Highlanders retrofitted with an almost identical sensor architecture as the robotaxi mapping and driving autonomously in San Francisco, Seattle, and Las Vegas.

An example of a Zoox vehicle negotiating a busy intersection in Las Vegas at night. The green boxes show the most likely prediction for each agent in the scene as far as 8 seconds into the future.

Zoox has a further advantage.

“We don’t need to label any data by hand, because our data show where things actually moved into the future,” says Wang. “My team doesn’t have a data problem. Our main challenge is that the future is inherently uncertain. Even humans cannot do this task perfectly.”

Utilizing graph neural networks

While perfect prediction is, by its nature, impossible, Wang’s team is currently taking steps on several fronts to raise the vehicle’s prediction capabilities to the next level, firstly by leveraging a graph neural network (GNN) approach.

“Think of the GNN as a message-passing system by which all the agents and static elements in the scene are interconnected,” says Mahsa Ghafarianzadeh, senior software engineer on the Prediction team.

“What this enables is the explicit encoding of the relationships between all the agents in the scene, as well as the Zoox vehicle, and how these relationships might develop into the future.”

A Zoox test vehicle navigating Las Vegas autonomously.

To give an everyday example, imagine yourself walking down the middle of a long corridor and seeing a stranger walking toward you, also in the middle of the corridor. That act of seeing each other is effectively the passing of a tacit message that would likely cause you both to alter your course slightly, so that by the time you reach each other, you won’t collide or require a sharp course-correction. That’s human nature.

This shows the output of Zoox models on the same initial scene but conditioned on different future actions the vehicle (green) is considering. Zoox is able to predict different yielding behavior of other cars based on when their vehicle enters the intersection. The center animation even shows they predict a collision if we were to take that particular action.

So this GNN approach results in the prediction of more natural behaviors between everyone around the Zoox vehicle, because the algorithm, through training on Zoox’s vast pool of real-world road data, is better able to model how agents, on foot or in cars, affect each other’s behavior in the real world.

Related content

Information extraction, drug discovery, and software analysis are just a few applications of this versatile tool.

Another way the Prediction team is improving accuracy is by embracing the fact that what you do as a driver affects other drivers, which in turn affects you. For example, if you get into your parked car and pull out just a little into busy traffic, a driver coming up the road behind you may slow down or stop to let you out, or they may drive straight past, obliging you to wait for a better opportunity.

“Prediction doesn’t happen in a vacuum. Other people’s behaviors are dependent on how their world is changing. If you’re not capturing that within prediction, you’re limiting yourself,” says Wang.

Next steps

Work is now underway to integrate Prediction even more deeply with Planner, creating a feedback loop. Instead of simply receiving predictions and making a decision on how to proceed, the Planner can now interact with Prediction along these lines: “If I perform action X, or Y, or Z, how are the agents in my vicinity likely to adjust their own behavior in each case?”

I’ve seen Prediction grow from being just three source code files implementing basic heuristics to predict trajectories to where it is now, at the cutting edge of deep learning. It’s incredible how fast everything is evolving.

In this way, the Zoox robotaxi will become even more naturalistic and adept at negotiations with other vehicles, while also creating a smoother-flowing ride for its customers.

“The team and I started to work on this new mode a couple years ago, just as a research project,” says Morales, “and now we’re focused on its integration, ironing everything out, reducing latency, and generally making it production-ready.”

The ever-increasing sophistication of the Zoox robotaxi’s predictive abilities is a clear source of pride for the team dedicated to it.

“I’ve been in this team for over five years. I’ve seen Prediction grow from being just three source code files implementing basic heuristics to predict trajectories to where it is now, at the cutting edge of deep learning. It’s incredible how fast everything is evolving,” says Ghafarianzadeh.

Indeed, at this rate, the Zoox robotaxi may ultimately become the most prescient vehicle on the road. Though that prediction comes with the usual caveat: Nobody can perfectly predict the future.

Source link


Please enter your comment!
Please enter your name here

Share post:


More like this

Nvidia: Chip giant posts record sales as boss sees AI ‘tipping point’

The chip giant's shares have soared by more...

How AI is helping the search for extraterrestrial life

Artificial intelligence software is being used to look...

Gemma, Google’s new open-source AI model, could make your next chatbot safer and more responsible

Google has unveiled Gemma, an open-source AI model...