Open-Loop Dexterous Manipulation with a Robot Hand

During RSS 2021, my colleagues and I published a surprising new finding. It turns out, the softness of hands lets us manipulate some objects completely blind: without any visual/tactile/force feedback. We studied the nature of these open-loop skills, and identified three key design principles for robust in-hand manipulation. Watch the video below.

This video shows a soft robotic handPuhlmann, Steffen, Jason Harris, and Oliver Brock. "RBO Hand 3: A Platform for Soft Dexterous Manipulation." IEEE Transactions on Robotics (2022). manipulating a cube. Obtaining such behavior on a robot is a hard problem, and it arguably looks as dexterous as OpenAI's DactylIn a reinforcement learning tour de force, OpenAI's robotics team trained a Deep RL agent to manipulate a wooden cube using the Shadow Hand. A year later, they improved it even further to manipulate a Rubik's cube. To achieve this, they trained the policy with several years of simulated experience. The simulations were domain-randomized to learn a robust policy... robust enough to transfer pretty well to the real hardware. Aside from the RL part, this work was interesting because skills like these had never been seen on a robot hand before. work. Contrary to what you'd think, we used no sensory feedback. Everything is 100% open-loop. The demonstration here used:

  • No cameras
  • No motion capture
  • No proprioception
  • No kinematic models

The pneumaticThat means we actuate it by pumping air in and out of its fingers. RBO Hand 3 has fingers made of soft inflatable silicone rubber. Kinematic models are very difficult — if not impossible — to obtain for soft robots that deform during interactions. So we gave up on notions like computing fingertip contact locationsAlso, who uses only their fingertips, anyway? We humans use our entire fingers to handle objects, inlcuding the sides and the back. etc. Yet, these open-loop pre-scripted skills work robustly, unmodified, with objects of different shapes, sizes, and weights.  Below, you can see that the exact same sequence of commanded inflation levels manipulates a variety of cuboidal objects, and — shockingly — even a cylinder. 

So why the heck does this work? It's been known for a while that compliant, underactuated grippers greatly simplify object grasping: when an object is near enough, just close all the fingers — without looking — and they wrap around the it, conforming to its shape
In wrapping around the object, soft hardware "reacts" to the shape of the object in a very useful way, essentially freeing you from having to do that in code (or training a neural network to do that). Of course, this doesn't work in all cases, and sometimes you really do have to carefully plan grasps.
.
Now this effect would be difficult to get with a rigid, fully-actuated hand. We've shown here that this effect helps not only with power grasps, but also with fine manipulation. In fact, it is central to in-hand manipulation!

These open-loop skills are also resilient to a fair amount of variability in object placement. See below: a manipulation episode can start out with somewhat different initial conditions every time, and the skills work just fine with a wide range of those. Note that we do not sense or track the object to adjust the actuation.

These skills seem quite robust, and because they return the object to roughly the same spot where it started out, we can loop their execution to see just how robust they are. While manually editing these inflation sequences, we treated the number of successful loops — an empirical quantification of robustness — as an objective to maximize. Below, you can see that we were able to loop the complex Twist + Pivot + Finger-Gait + Shift combination 54 times, and the simpler Spin + Shift sequence a whopping 140 times before the cube dropped out.

Mind you: all of this is executed completely blind, with no sensory feedback to correct mistakes. So how, exactly, did we design these skills?

The RBO Hand 3

The RBO Hand 3 has 16 actuators. We control it by commanding air-mass "goals" in milligrams for each actuator. So any one vector of 16 numbers prescribes one hand posture.

We commanded these actuation signals using a large mixing-board (typically used by DJs for mixing music). 

We first observed ourselves manipulating cubes in our own hands. We then tried to copy some moves onto the robot. To do so, we placed a wooden cube on the robot hand, and moved the sliders back and forth to get the cube to turn 90 degrees. Once we perfected the manipulation strategy, we saved intermediate snapshots of these inflation levels as keyframesA term from animation, "keyframe" just means a postural snapshot. You can see these keyframes below.. By linearly interpolating through this sequence of keyframes, we could replay the skill on the robot hand. With a good amount of careful fine-tuning, we were able to make the 90-degree spin highly reliable. We designed the other skills in the same way.

From roughly similar initial placements, simply replaying these keyframe sequences onto the hand places the cube in consistently similar outcome positions. This also transferred across differently sized cubes!
From roughly similar initial placements, simply replaying these keyframe sequences onto the hand places the cube in consistently similar outcome positions. This also transferred across differently sized cubes!

Let's stop and think about what's happening here: in-hand manipulation is a hard problem that involves considerations of object geometry, planning contact points, balancing interaction forces, etc. All of these things matter. But we just showed here that you can bypass the explicit handling of those things, well enough that a fixed open-loop policy is able to generalize in order to manipulate different objects with different initial conditions.

In the video below, we explain this idea by getting the hand to twirl a wooden donut, using an absurdly simple control signal.

Let's step back and take a moment to absorb this. This donut-twirling skill would be super hard to design with a fully-actuated rigid robot hand. You'd probably need force sensors, maybe a geometric model of the donut, inverse kinematics for the fingertips, and a whole lot of math to tie it all together. That problem would involve maintaining adequate contact forces, planning a trajectory, and maybe tracking the object and finger positions along the way.

Yet here, with this soft hand, a simple step adjustment to two actuators' inflations triggers this cool twirl. We do not compute anything. Instead, we just hitch a ride on the physics. This approach is super low-tech, and does something that advanced robot hands have a hard time doing. We published this in 2021. But this work could as well have been done 30 years ago. Air pumps and inflatable rubber are old things. 

A different paradigm: Hardware-Accelerated Robotics

The springy hardware, when placed in certain situations, exhibits behaviors that I like to call Funky DynamicsSerious researchers call it Morphological Computation.. Think of the hand's hardware as a low-level library. The actuation signal above is a function call to that library. The hand-object configuration at the start? That's the "argument" we supply to that function. Executing that function triggers some "computation" that is hard to do with a CPU — i.e., we let physics do its thing — and "returns" an eventual state that is the outcome of our manipulation.

Much like we offload expensive matrix multiplications, convolutions, and graphics routines to GPUs, we can offload control primitives to capable hardware. This leaves us with a much simpler control problem: that of pulling a few strings to puppeteer the physics. This hardware acceleration gives us — as a foundation — things we like in a control policy:

  • Robustness
  • Simplicity
  • Generalization

Let's talk about generalization for a bit. Imagine a control policy function looks like a neural network. There's no reason it should be one, but it helps me (an RL guy) as a thought scaffold, and perhaps it will help you.

A control policy function has two ends: an input (the perception side), and an output (the action side). Generalization is often discussed in the context of perception: e.g. ConvNets train and generalize better than simple fully-connected MLPs because convolution presupposes some degree of translation invariance, among other priors. Fewer works talk about generalization in the case of action. Much likeThink of compliance as a broad inductive bias, like convolution, for manipulation problems. Inductive biases can also be more task-specific. To torture this analogy further: just like the U-Net architecture serves as a good inductive bias for image segmentation tasks, a task-appropriate compliant morphology is a great idea for solving special classes of physical problems. A neat and famous example is the dead fish swimming upstream. This fish body was optimized through evolution to subsume a large part of the swimming policy; imagine the last layer of the policy network above, implemented in body shape and meat. Co-design of policy and morphology is an active area of research. Convolution on the perception side, compliant hardware can alias similar-but-different situations in a useful way.

Case in point: below, the exact same actuation signal twirls not only that donut, but also cubes, wedges, pyramids, and cylinders (we did not tweak the actuation to make this work; it generalized on the first try). 

Constraining the Dynamics to be Funnel-Shaped

Behaviors like the twirl here are primitives. To build more interesting skills, we must sequence these primitives into chains. That means the result of one primitive movement must be an acceptable precondition of the next. Thanks to the nice qualities of generalization and robustness, the precondition sets of these primitives are rather "tolerant". That said, we must still do the hard work of making sure these primitives align together.

But instead of explicitly controlling things, we're just surfing the physics, so we can't really ask the hand to achieve exact desired object positions. What we can do instead, is to tame the funky dynamics of these primitives by tweaking them — with actuation — to be highly predictable in whatever they do. That means, we leverage the physics as the main "driver" of the movement, and robustify it using carefully crafted physical constraints.

This is called constraint exploitation: we reduce the position uncertainty not through sensing and tracking, but instead through actions that, by construction, will narrowly restrict some system feature to a small known range of values (in the doodle above, that feature is the planar rotation angle, but it could be anything). Skills — or primitives — that do such things robustly are called FunnelsMatthew T. Mason, The Mechanics of Manipulation, 1985 in literature dating back to the 80's.

Sequence your Funnels to build Robust Skills

Funnel-like policies are cool thanks to their potential for compositionality. If you've got two funnels, and one funnel's exit resides in the other funnel's entrance, you can blindly execute them in a sequence, and know the eventual outcome before you even begin. What's more, each funnel sequence is also a funnel in its own right. This lets us compose long, dexterous manipulation skills.

Funnels don't need to be open-loop! But we already had a bunch of nontrivial open-loop skills, and wanted to see how far we could take this. We robustified our skills with carefully chosen actions, and after lots of careful fine-tuning, were able to build giant skill sequences like this demo below. None of the movements here are specific to any of the cubes' shapes.

We think these design principles could be very powerful inductive biases for robotic reinforcement learning approaches. I hope to explore these ideas in my future research.

Paper: http://www.roboticsproceedings.org/rss17/p089.pdf
RSS 2021 Spotlight Talk

Related posts