Alexa Arena is a new embodied-AI framework developed to push the boundaries of human-robot interaction. It offers an interactive, user-centric framework for creating robotic tasks that involve navigating multiroom simulated environments and manipulating all types of objects in real time. In a gamelike setting, users can interact with virtual robots through natural-language dialogue, helping the robots complete their tasks. The framework currently includes a large set of multiroom layouts for a home, a warehouse, and a lab.
Arena enables the training and evaluation of embodied-AI models, along with the generation of new training data based on the human-robot interactions. It can thus contribute to the development of generalizable embodied agents with a wide variety of AI capabilities, such as task planning, visual dialogue, multimodal reasoning, task completion, teachable AI, and conversational understanding.
We have publicly released (a) the code repository for Arena, which includes the simulation engine artifacts and a machine learning (ML) toolbox for model training and visual inferencing; (b) comprehensive datasets for training embodied agents; and (c) benchmark ML models that incorporate vision and language planning for task completion. In addition, we have also launched a new leaderboard for Arena, to evaluate the performance of embodied agents on unseen tasks.
The simulation engine of Alexa Arena is built using the Unity game engine and includes 330+ assets spanning both commonplace objects in homes (such as refrigerators and chairs) and uncommon objects (such as forklifts and floppy disks). Arena also features more than 200,000 multiroom scenes, each with a unique combination of room specifications and furniture arrangement.
In addition, each scene can randomize the robot’s initial location, the placement of movable objects (such as computers and books), floor materials, wall colors, etc., to provide the rich set of visual variations needed to train embodied agents through both supervised and reinforcement learning methods.
To make games more engaging, Arena includes live background animations and sounds, user-friendly graphics, smooth robot navigation with live visuals and support for multiple viewpoints, views that can be switched between first-party and third-party cameras, the hazards and preconditions that can be incorporated into task completion criteria, a mini-map showing the location of the robot within a scene, and a configurable hint-generation mechanism. After the execution of every action in the environment, Arena generates a rich set of metadata, such as images from RGB and depth cameras, segmentation maps, robot location, and error codes.
Long-horizon robotic tasks (such as “make a hot cup of tea”) can be authored in Arena, using a new challenge definition format (CDF) to specify the initial states of objects (such as “cabinet doors are closed”), goal conditions to be satisfied (such as “cup is filled with milk or water”), and textual hints planted at specific locations in the scene (such as “check the fridge for milk”).
The Arena framework powers the Alexa Prize Simbot challenge, in which 10 university teams are competing to develop embodied-AI agents that complete tasks with guidance from Alexa customers. Customers with Echo Show or Fire TV devices interact with the agents through voice commands, helping the robots achieve goals displayed on-screen. The challenge finals will take place in early May 2023.
The code repository for Arena includes two datasets: (a) an instruction-following dataset, containing 46,000 human-annotated dialogue instructions, along with ground truth action trajectories and robot view images, and (b) a vision dataset containing 660,000 images from Arena scenes spanning 160+ semantic-object groups, collected by navigating the robot to various virtual locations and capturing images of the objects there from different perspectives and distances.
The data collection methodology that we used to create the instruction-following dataset is similar to the two-step procedure that we adopted in our earlier work on DialFRED, where we used demonstrative videos (generated by a symbolic planner) to create crowd-sourced natural-language instructions in the form of multiturn Q&A dialogues.
Using the datasets mentioned above, we trained two embodied-agent models as benchmarks for Arena tasks. One is a neuro-symbolic model that uses the contextual history of past actions and a dedicated vision model:
The other is an embodied vision-language (EVL) model that incorporates a joint vision-language encoder and a multihead model for task planning and mask prediction:
To evaluate our benchmarks, we used a metric called mission success rate (MSR), which is the ratio of successfully completed tasks to total tasks, across all tasks in the evaluation set.
In our experiments, the EVL model achieves an MSR of 34.20%, which is 14.9 percentage points better than the MSR of the neuro-symbolic model. The results also indicate that the addition of clarification Q&A dialogue boosts the performance of the EVL model by 11.6% by enabling better object instance segmentation and visual grounding.
Alexa Arena is another example of Amazon’s industry-leading research in artificial intelligence and robotics. In the coming years, the Arena framework will be a critical tool for the development and training of new devices and robots that bring about a whole new era of generalizable AI and human-robot interaction.