The year is 2025 and we’re able to generate human-level text, images, audio, and video; can we generate human-level motion by 2050? I think the answer is most certainly yes.

If we were to jump forward to 2050, and try to imagine an ideal world of how the robotic foundation models turned out, I’d hope for embodied models that can:

  • learn new tasks in real-time at the speed of smart human adults (video games, car maneuvering, etc.) via trial and error or imitation
  • practice learning new skills on their own to the point of competing in human olympics, if the robot has the physical ability
  • conform to Isaac Asimov’s “Three Laws of Robotics”1
  • pass some kind of Nikola Tesla safety/alignment benchmark - a clear, convincing demonstration of safety similar to how Tesla proved AC current is safe
  • learn to cooperate and coordinate in a team of robots to play games like soccer or do long-term tasks like home construction
  • think, reason, plan, and execute complex instructions like organizing a child’s birthday party
  • learn to play complex board games like Go in the physical world and where more hours spent = better performance
  • retain physical experiences across time and apply them to new situations
  • transfer across embodiments via fine-tuning
  • multi-modal output actions, text, and voice
  • be deployed on edge computers onboard with controlled weights access

I expect the field to make progress on many of these points in the near future, but conducting this thought experiment now helps design robotic foundation models optimized for the long term.




Notes

  1. (1) A robot may not injure a human being or, through inaction, allow a human being to come to harm. (2) A robot must obey orders given it by human beings except where such orders would conflict with the First Law. (3) A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.