Imagine a world similar to “The Jetsons,” where a robotic assistant can effortlessly pivot from vacuuming the living room to whipping up dinner. While this idea is entertaining, the reality of training a general-purpose robot has proven to be exceedingly complex.
Currently, engineers gather data tailored to specific robots and tasks, training them in controlled environments. Unfortunately, this method is both time-consuming and costly, often leaving robots ill-equipped to handle unexpected challenges or environments.
Researchers at MIT have introduced an innovative solution aimed at overcoming these obstacles. They’ve developed a versatile technique that amalgamates vast quantities of heterogeneous data from multiple sources into a single system that equips robots to learn diverse tasks.
Their approach focuses on aligning data from various domains, such as simulations and real robots, as well as multiple modalities like visual sensors and robotic arm position encoders, creating a unified “language” that generative AI models can comprehend.
By leveraging this extensive dataset, training a robot to perform numerous tasks becomes more feasible without starting from the ground up each time. This method promises to be both quicker and more cost-effective than traditional techniques, requiring significantly less task-specific data. Impressively, it surpassed traditional training methods by over 20% in both simulated and real-world scenarios.
“In robotics, there’s a frequent claim that we lack sufficient training data. However, I’d argue that another major issue is the diversity of data across various domains, modalities, and robot hardware. Our findings illustrate how we can enable robots to learn from combined datasets,” explains Lirui Wang, an electrical engineering and computer science (EECS) graduate student and the lead author of a recent study on this technique.
Wang collaborated with Jialiang Zhao, a fellow EECS graduate student; Xinlei Chen, a research scientist at Meta; and Kaiming He, an associate professor in EECS and a member of the Computer Science and Artificial Intelligence Laboratory (CSAIL). Their research will be presented at the upcoming Conference on Neural Information Processing Systems.
Driven by Large Language Models
Each robotic “policy” interprets sensor data, such as images from cameras or proprioceptive measurements that monitor a robotic arm’s positioning and speed, directing robots on how to move.
These policies usually rely on imitation learning, which involves humans demonstrating tasks or manually operating robots to create training data for the AI model. Due to this reliance on a limited pool of task-specific data, robots frequently struggle when faced with new environments or assignments.
Wang and his research team sought to improve this process by drawing parallels to large language models like GPT-4, which are pre-trained on extensive, diverse datasets before being fine-tuned with smaller amounts of task-specific data. This pre-training enables models to transfer their knowledge and adapt to various tasks effectively.
“In language processing, the data consists of sentences. However, in robotics, with the extensive variety of data types, we required an innovative architecture for pre-training,” he notes.
Robotic data encompasses numerous forms, including visual imagery, textual instructions, and depth maps. Moreover, each robot features unique mechanical configurations – their numbers and orientations of arms, sensors, and grippers differ greatly. Additionally, the environments where data is gathered vary significantly.
To tackle this complexity, the MIT team created a new architecture termed Heterogeneous Pretrained Transformers (HPT), designed to harmonize data from diverse modalities and domains.
At the core of their architecture is a transformer model, which processes both visual and proprioceptive inputs. This same type of model powers many leading large language models.
The team aligns visual and proprioceptive data into a consistent format, called tokens, allowing the transformer to process them effectively. Each data input is standardized to maintain a fixed number of tokens throughout the architecture.
As the transformer ingests more data, it evolves into a large, pretrained model, enhancing its performance with scale. Users can provide HPT with a minimal dataset concerning their robot’s configuration and the related task, subsequently transferring the knowledge gained during pretraining to facilitate learning.
Unlocking Dexterity in Robotics
A significant challenge in establishing HPT was generating an extensive dataset for pretraining, which encompassed 52 diverse datasets totaling over 200,000 robot trajectories across four categories—human demonstration videos and simulations included.
The researchers also devised an efficient method to convert raw proprioceptive signals from various sensors into data suitable for the transformer’s capabilities.
“Proprioception plays a crucial role in enabling complex movements. Our architecture ensures that proprioceptive input receives equal importance as visual input, as the token count remains consistent,” Wang emphasizes.
Testing revealed that HPT enhanced robot performance by over 20% in both simulated environments and real-world applications compared to training from scratch. Remarkably, performance improvements persisted even when tasks diverged significantly from the pretrained data.
David Held, an associate professor at Carnegie Mellon University’s Robotics Institute who was not involved in this project, remarked, “This paper presents a groundbreaking method for training a singular policy applicable to various robot forms. It facilitates learning from diverse datasets, thereby increasing the capacity of robotic training methods. Furthermore, it permits rapid adaptation to new robot designs, an essential aspect as innovation in robotics continues.”
Looking forward, the research team aims to investigate how data diversity can elevate HPT’s performance further. They also plan to refine HPT to manage unlabeled data, emulating the capabilities of large language models like GPT-4.
“Our aspiration is to create a universal robot brain that can be downloaded and implemented on any robot without necessitating training. Although we are still in the initial phases, we are determined to push boundaries, hoping for breakthroughs akin to those achieved with large language models,” Wang concludes.
This groundbreaking research received funding from the Amazon Greater Boston Tech Initiative and the Toyota Research Institute.
Photo credit & article inspired by: Massachusetts Institute of Technology