What if robots learned the same way genAI chatbots do?

What if robots learned the same way genAI chatbots do?

There’s no question that robotics is transforming our world. Thanks to computerized machines, manufacturing, healthcare, agriculture, supply chains, retail, automotive, construction, and other industries are seeing rapidly increasing efficiencies and new capabilities.

One challenge with bringing new robots online is that it’s hard, expensive, and time-consuming to train them for the task at hand. Once you’ve trained them, you have to retrain them with every minor tweak to the system. Robots are capable, but highly inflexible. 

Some of the training is handled by software coding. Other methods use imitation learning, where a person teleoperates a robot (which, during training, essentially functions as a puppet) to kickstart data for robot movement. 

Both approaches are time-consuming and expensive. 

Compounding the difficulty is a lack of standards. Each robot manufacturer uses its own specialized programming language. The interfaces used for teaching robots, especially “teach pendants,” tend to lack the modern attributes of the major, non-proprietary software development environments. (A teach pendant is a handheld control device that enables operators to program and control robots, enabling precise manipulation of the robot’s movements and functions.)

The lack of standards adds both complexity and costs for obvious reasons. Robot programming courses can cost thousands of dollars, and companies often need to train many employees on several robotics programming platforms. 

Because of a lack of standards, because robots are inflexible once trained, and because robot skill development is manual and task-by-task, it is complex, time-intensive, and costly. 

MIT to the rescue?

To solve the enormous problems of robot training, MIT researchers are developing a radical, brilliant new method called Heterogeneous Pretrained Transformers, or HPTs.

The concept is based roughly on the same concept of large language models (LLMs) now driving the generative AI boom. 

LLMs use vast neural networks with billions of parameters to process and generate text based on patterns learned from massive training datasets. 

HPTs work by using a transformer model to process diverse robotic data from multiple sources and modalities. To that data, the model adds and aligns vision and robot-movement inputs in the form of tokens. And all this is processed by an actual LLM. The larger the transformer, the better the robot’s performance. 

While LLMs and HPTs are very different — for starters, every physical robot is mechanically unique and very different from other robots — they both involve vast training datasets from many sources. 

In the case of HPTs, researchers added data from real physical robots and simulation environments and multi-modal data (from vision sensors, robotic arm position encoders, and others). The researchers created a massive dataset for pretraining, including 52 datasets with more than 200,000 robot trajectories.

As a result, HPTs need far fewer task-specific data. And this is early days for the method. As with LLMs, it’s reasonable to expect massive advances in capability with additional data and optimization. 

Researchers found that the HPT method outperformed training from scratch by more than 20% in both simulations and real-world experiments.

Limitations to HPT robot training

While HPTs show promise, they’re still limited and need development. 

Just as even more advanced LLM-based chatbots can “hallucinate” and tend to be polluted with bad data, HPTs need a mechanism for filtering out bad data from the datasets. Nobody wants a powerful industrial robot “hallucinating” and freaking out on the factory floor.

While LLMs and HPTs are similar in concept, LLMs are far more advanced because the available datasets are massively higher. To industrialize the method, the models would need massive quantities of probably simulated data to add to the real-world data. 

As it was during the early days of LLMs, HPT research at MIT is currently averaging below 90% success rates.

According to the researchers, future research should explore several key directions to overcome the limitations of HPT.

To unlock further potential in robotic learning, training objectives beyond supervised learning, such as self-supervised or unsupervised learning, should be investigated. 

It is important to grow the datasets with diverse, high-quality data. This could include teleoperation data, simulations, human videos, and deployed robot data. Researchers need to learn the optimal blend of data types for higher HPT success rates. 

Researchers and later industry will need to create standardized virtual testing grounds to facilitate the comparison of different robot models. (These would likely come from Nvidia.)

Researchers also need to test robots on more complex, real-world tasks. This could involve robots using both hands (bimanual) or moving around (mobile) to complete longer, more intricate jobs. Think of it as giving robots more demanding, more realistic challenges to solve.

Scientists are also looking into how the amount of data, the size of the robot’s “brain” (model), and its performance are connected. Understanding this relationship could help us build better robots more efficiently.

Another exciting area is teaching robots to understand different types of information. This could include 3D maps of their surroundings, touch sensors, and even data from human actions. By combining all these different inputs, robots could learn to understand their environment more like humans do.

All these research ideas aim to create smarter, more versatile robots that can handle a wider range of tasks in the real world. It’s about overcoming the current limitations of robot learning systems and pushing the boundaries of what robots can do.

According to an MIT article on the research, “In the future, the researchers want to study how data diversity could boost the performance of HPT. They also want to enhance HPT so it can process unlabeled data like GPT-4 and other large language models.”

The ultimate goal is a “universal robot brain” that could be downloaded and used without additional training. In essence, HPTs would enable robots to perform far closer to how people act. Specifically, a new, un-trained employee hired to work on an assembly line already knows how to pick things up, walk around, manipulate objects, and identify widgets by sight. They then start out haltingly, gaining confidence with additional skills acquired through practice. MIT researchers see HTP-trained robots as operating the same way. 

This raises obvious concerns about replacing human workers with robots, but that’s a subject for another column. 

In the meantime, I think MIT researchers are onto something here: a new technology that could — and probably will — radically accelerate the industrial robotics revolution.