When it comes to your daily tasks, you probably don’t think twice about the steps involved: washing the dishes, buying groceries, and other minor activities. You won’t likely itemize each tiny step, like “pick up the first dirty dish” or “wash that plate with a sponge,” because completing these actions feels second nature. However, for a robot, executing these everyday tasks requires a sophisticated plan and a deeper set of instructions.
Enter the Improbable AI Lab from MIT, a research group within the renowned Computer Science and Artificial Intelligence Laboratory (CSAIL). They are tackling this challenge with a groundbreaking multimodal framework known as Compositional Foundation Models for Hierarchical Planning (HiP), which empowers machines to draft detailed, actionable plans by leveraging three distinct foundation models. Similar to OpenAI’s GPT-4, these foundation models are trained on vast datasets to perform various tasks, including generating images, translating text, and facilitating robotics.
Unlike existing multimodal models like RT2, which rely on paired vision, language, and action data, HiP incorporates three separate foundation models, each trained on different types of data. This innovative approach allows each model to capture unique aspects of decision-making, collaborating effectively during the planning phase. One of HiP’s major advantages is that it eliminates the necessity for hard-to-obtain paired data, simultaneously enhancing the transparency of the reasoning process.
A routine chore for humans transforms into a “long-horizon goal” for a robot—an overarching objective that necessitates executing numerous smaller steps. To achieve this, a substantial amount of data is required to plan, comprehend, and successfully execute tasks. While researchers in computer vision have previously attempted to create all-encompassing foundation models, the challenges of pairing language, visual, and action data can be costly and complex. HiP, however, offers an effective multimodal solution by integrating linguistic, physical, and environmental intelligence into a cohesive unit at a lower cost.
“Foundation models do not have to be monolithic,” explains NVIDIA AI researcher Jim Fan, who was not involved in this study. “This work disaggregates the complex task of embodied agent planning into three distinct models: a language reasoner, a visual world model, and an action planner. This segmentation makes the intricate decision-making process more manageable and transparent.”
The research team believes that HiP could assist robots in various household chores, such as organizing books or loading bowls into dishwashers. Moreover, this framework holds promise for streamlining multi-step operations in construction and manufacturing, such as stacking and positioning materials in precise sequences.
Evaluating HiP
Testing on three manipulation tasks demonstrated HiP’s superior performance, exceeding that of similar frameworks. The system’s ability to create intelligent plans that adapt to evolving information was particularly noteworthy.
In one task, researchers challenged HiP to stack colored blocks while concurrently placing others in proximity. The twist was that some required colors were missing, prompting the robot to paint the necessary blocks white. HiP adeptly adjusted its plans in response to this real-time challenge, showing greater flexibility than state-of-the-art task planning systems like Transformer BC and Action Diffuser.
In another scenario, the robot was tasked with organizing items like candy and a hammer inside a brown box while disregarding irrelevant objects. Encountering some dirty items, HiP recalibrated its plans to first clean them and then successfully place them in the box. Finally, in a kitchen-related task, the bot ignored unnecessary items, efficiently achieving sub-goals like opening a microwave, moving a kettle, and activating a light. It displayed remarkable adaptability by skipping previously completed steps.
A three-pronged hierarchy
HiP operates through a hierarchical three-tier planning process, enabling the pre-training of each component on diverse datasets, even extending beyond robotics. At the foundational level is a large language model (LLM) that conceptualizes the task at hand by gathering symbolic information and devising an abstract task outline. By leveraging general knowledge sourced from the internet, the model breaks down the main objective into manageable sub-goals. For example, “making a cup of tea” is dissected into tasks like “filling a pot with water” and “boiling the pot.”
“Our goal is to effectively harness existing pre-trained models and ensure they communicate seamlessly,” states Anurag Ajay, a PhD student within MIT’s Department of Electrical Engineering and Computer Science (EECS) and a CSAIL affiliate. “Rather than pushing for a single model to handle all tasks, we merge several that capitalize on different internet data modalities. When operated together, they enhance robotic decision-making, potentially benefiting households, factories, and construction sites.”
To truly comprehend their environment and proficiently execute each sub-goal, these models require a visual component. The CSAIL team employed a large video diffusion model to supplement the preliminary planning done by the LLM, assimilating geometric and physical data from online videos. This model generates an observation trajectory plan, further refining the LLM’s outline to include newly acquired physical insights.
This iterative refinement process fosters HiP’s capability to evaluate its strategies over time, incorporating feedback at each stage much like how a writer collaborates with an editor. It may involve multiple revisions until a polished draft is produced.
The pinnacle of the HiP hierarchy is an egocentric action model, which uses first-person imagery to infer necessary actions based on environmental cues. At this stage, the observation framework from the video model is mapped onto the robot’s visible area, guiding the machine in executing each task associated with the overarching goal. For instance, if HiP is employed to brew tea, it will have carefully mapped the location of crucial items like the pot and sink, allowing for efficient action completion.
Despite its impressive capabilities, HiP’s multimodal approach is currently constrained by a lack of high-quality video foundation models. Once these become available, they could enhance HiP’s smaller scale video models, improving visual sequence predictions and robot actions. A higher-quality alternative would also lessen the existing data requirements for video models.
Nevertheless, the CSAIL team’s method required only a minimal amount of data overall. Furthermore, HiP demonstrated an economical training approach, showcasing the potential of readily accessible foundation models to tackle long-horizon tasks effectively. “Anurag’s findings provide compelling evidence that we can merge models trained on varied tasks and data types for robotic planning,” remarks Pulkit Agrawal, MIT assistant professor of EECS and director of the Improbable AI Lab. The team is now contemplating applying HiP to tackle real-world long-horizon challenges in robotics.
Ajay and Agrawal are the principal authors of a paper detailing this work. They collaborated with MIT professors and CSAIL principal investigators Tommi Jaakkola, Joshua Tenenbaum, and Leslie Pack Kaelbling; CSAIL research affiliate and MIT-IBM AI Lab research manager Akash Srivastava; graduate students Seungwook Han and Yilun Du ’19; former postdoc Abhishek Gupta, now an assistant professor at the University of Washington; and former graduate student Shuang Li PhD ’23.
This research received support from the National Science Foundation, the U.S. Defense Advanced Research Projects Agency, the U.S. Army Research Office, the U.S. Office of Naval Research Multidisciplinary University Research Initiatives, and the MIT-IBM Watson AI Lab. Their findings were presented at the 2023 Conference on Neural Information Processing Systems (NeurIPS).
Photo credit & article inspired by: Massachusetts Institute of Technology