Gesture control: The future of human-machine interaction.

Not so long ago, the topic of human-machine communication or interaction by means of gestures was a purely scientific field of research. In everyday life, the ideas and technologies that emerged from it were the first to reach the games industry. Game consoles were developed to give new impetus to the classic video gaming industry. Consoles that motivated the user to become physically active. The boundaries between video gaming and sports became fluid when Nintendo launched Wii Sports ResortTM. This was made possible by the new Wii Motion Plus Controller, which was able to detect the movements of a player in space with great precision.

Pure movements in space are actually not gestures in the classical sense, because among other things the linguistic component is missing. The article "Murder in the Smart Home – Inspector Columbo investigates!” examines this question. What a gesture is must be redefined in human- machine communication, even if the purpose is unchanged. Samsung televisions are controlled by movements in space, by gestures. DJI Spark drones, for example, react to hand signals even from a distance, and then track their owner via face recognition.

Classical methods of human-machine gesture communication.

Nintendo Wii - the end of pressing buttons.

In 2006, Nintendo revolutionized the way you play with the Wii game console. While the Nintendo GameCube used cables, buttons and an analogue stick for communication, Wii was equipped with a wireless controller. Built-in motion sensors detected its position in space as well as the speed of movements. The controller enabled a whole new way of communicating with a machine in a playful way. New applications were created that led to Wii Sports ResortTM: Golf, table tennis, perfecting your tennis serve and much more. This was so well received that Wii sold over 100 million copies. Sony's PlayStation Move and Microsoft's Kinect for Xbox 360 were the answer to this success.

Smartphones - wipe and pull, maybe even shake.

Since people no longer leave the house without their smartphones, gestures for human-machine communication have become indispensable. They include not only motion sensors, but also proximity sensors, GPS, barometric altimeters, biometric sensors and soon others that make smartphones more powerful than all current controllers. The trend is to replace controllers with smartphones, for example to control drones or camera gimbals. The pure touch gestures are followed by those that use motion sensors: "Shaking", for example, shaking the head to revoke actions. Touch gestures and other gestures on the smartphone have a big advantage: By vibrating or similar, they can give immediate feedback whether gestures have been understood.

The future of human-machine communication is called "distance image".

Wiping, pulling, shaking, playing tennis with controllers and more: It can be exciting. But human gestures are something much more complex. They are initiated and verbally completed by hands, the head, especially the face, as well as the body, either individually or in combination. Gestures are a complex temporal overall event that takes place in space. Scenes that must be tracked in order to understand them. For real human-machine interaction to take place on the basis of human gestures, more than a gyro or proximity sensor is required.

The future of human-machine gesture interaction takes place in the form of distance images tracked over a time axis. Gestures can be understood as a scene in space with defined beginning and end. To be able to interpret and use them by machine, the spatial position of hands and/or head, face and body silhouette must be measured and stored as a time series. Afterwards, or in the best case real-time, the interpretation can take place.

Camera-based measurement and tracking of objects in space.

Stereo cameras – old, good, but slow.

Everyone knows the red-green glasses with which stereo images or films can be viewed. The principle is simple and inexpensive: two cameras simultaneously record the same scene with a horizontal offset of 65 mm (the average human eye distance). In order for the 3D images to be correctly assembled in the brain, they must be processed in such a way that the respective eye only sees the corresponding half image through the red or green filter. Distance information can also be obtained in this way. However, the process is computationally intensive. This method is not suitable for tracking entire scenes, since the frequency required for hand gestures, for example, should not be less than 120 HZ.

ToF cameras – the means of choice.

ToF cameras, Time of Flight cameras, are the solution to the problem. They can take high- frequency distance pictures. This is the basic prerequisite for tracking and interpreting demanding human gestures. Due to the PMD chips (Photonic Mixing Device) used, the cameras are also called PMD cameras. The idea as well as the technical solution go back to the research team around Prof. Rudolf Schwarte in 1996. In contrast to laser measurements, the object is not scanned time-consuming (unsuitable for gesture recognition) but measured without any time delay. Simplified, a light pulse raster is emitted in the direction of the object to be measured and the distance is then determined on the basis of the wavelength and time differences. This works at distances of up to 500 m and also in a low-contrast environment. Due to the high frame rate, currently 160 fps (frames per second), modern ToF cameras are capable of real time.

The hybrid creature – Microsoft Kinect.

A hybrid creature for gesture recognition is the Microsoft Kinect controller for the Xbox 360, the answer to the Wii game console from Nintendo. It combines the idea of the stereo and ToF camera into a new approach to gaining distance information. A projector projects a dot grid onto an object. A horizontally offset camera records the projection scene. Distance information is derived from this. The technology is inexpensive, but has significant disadvantages compared to the ToF camera. It requires a certain contrast range of the scene, is limited in range and also high image frequencies to recognize complex gestures are not possible. For simple gestures, Kinect works very well, however, which prompted Bill Gates to announce in 2009 that he was considering integrating gesture control technology into MS Office and MS Mediacenter. In 2011 Microsoft released its own Kinect SDK for non-commercial developers. This resulted in interesting and also unexpected applications. For example, South Korea uses Kinect technology to monitor the Korean demarcation line.

BMW "HoloActive" Touch – pure innovation.

Not only the exact tracking and interpretation of human gestures is a challenge, but also the interaction with the gesture generator. If a touchpad gesture is detected, the monitor status changes or feedback in the form of a vibration is sent.

If gesture communication is to be based on ToF cameras, two new problems arise: The gesture sender must know exactly where to "send" the gesture and needs feedback as to whether the gesture was recognized and triggered an action. Designers currently use unspectacular monitors as feedback transmitters.

The BMW prototype "HoloActive Touch" tackles these problems in a technologically revolutionary way. The presentation took place at the CES in Las Vegas 2017, where a virtual touchpad is holographically projected next to the steering wheel. Gestures performed on this "touchpad" are interpreted on the basis of a distance camera. Thus detected, an ultrasound array is sent to the fingertip, which provides haptic feedback to the gesture generator according to the vibration on the real touchpad. It is not yet known when the technology will be seen in BMW vehicles, which is being developed in cooperation with Motius GmbH.

Tracking and evaluation.

Powerful ToF cameras deliver up to 160 distance images per second in the form of a data record that shows the position in space for each transmitted light pulse. The challenge is on the one hand to process the enormous amount of data of a tracked scene, real-time or near-real-time, so that it is relevant for control, and on the other hand to interpret the gesture. This requires high- performance time series technology, which, if several ToF are used, must also synchronously merge time series from distributed sources. This places high demands on software and hardware.

On the other hand, it is necessary to interpret the body, head, face or hand condition in order to be able to infer the gesture. Currently, the wealth of experience of the gaming and film animation industry is being drawn upon. In order to realize avatars and animations as realistically as possible, complex body, head, face and hand models have been developed. Functions depict body elements connected via joints. The change of state over the time axis defines the movement. For the interpretation of ToF data, these models are linked with distance information, for example a standard skeleton is projected onto the distance model for analysis.

Real-time as a challenge.

For gesture control to make sense, human-machine interaction must be real-time. Solutions to determine the required measurement data are available on the market. The crux is currently the processing and interpretation of this data. Approaches to this will be discussed in the last article of the gesture trilogy.

Contact the author Stefan Komornyik.

Back

Human-machine communication – the future at the start.

Related Files

Related News