AI can hear by seeing, and it’s years since that.

Computers can hear by seeing, and it’s years since it’s become possible. Six years ago, researchers at MIT found a way to extract sound from silent video of everyday objects inside a soundproof room recorded by high-frequency special cameras. This fascinating feat was achieved by processing tiny vibrations of surrounding objects when hit by sound waves, which are nearly invisible to the unarmed eye. Small swings of a potted flower or minute rattlings of headphone buds could tell stories of what is going on around them.

Several computing technologies were involved in the job. Videos were processed using OpenCV to magnify and extract subtle changes in the video. Great. It was time to do the same thing with regular cameras. However, if you are familiar with signal processing, you probably know about the Nyquist–Shannon sampling theorem according to which the sampling rate of the video camera must be well above the frequency of the audio signal that we intend to extract-i.e. Much more than the 60fps smartphone cameras. To Compensate for this defect, they used common DSLR cameras with rolling shutters mechanism. This enabled them to infer a higher frequency signal from the regular 60fps video. As expected, the resulting audio was not as clear as for the 6000fps video recordings used for the first experiment. But nicely enough, They could recognize the number of people speaking in a room, their gender, and even their identities.

Other factors impacting the quality of extracted audio are distance and the material of the object. Things used in the experiment like the plant leaves, earbuds, chips bag are light and rigid, so they display clear movement when impacted by sound waves. Fluids and soft objects absorb too much sound to be good sources for the visual microphone. Heavier objects like bricks need too much energy to show visible vibrations, so filming the wall won’t be of much use.

Describing the signal processing aspects of the visual microphone may be boring. Here we discuss the most exciting part, which is the several uses suggested for this new technology. The Visual microphone, as the researchers call it, can be useful in Biomedical imaging, and help medical surveillance by extracting heartbeat signals from the video of a patient’s head movement.
It could measure the mechanical properties of transparent fluids such as hot/cold air and water.
It could do motion detection on binocular -two-camera- videos to extract depth maps and produce 3D versions. To derive the spoken words from vibrations of a person’s neck. (Imagine what it could do if applied to old 16mm silent movies!). And, in a far-sighted, imaginative view, we could even think of recording sound from space!.

However, as with any other feat of science, this algorithm may be used in the dark side. Espionage is the first concern that comes to mind, followed by common privacy concerns about your creepy neighbour eavesdropping on your conversations. But Michael Rubinstein of Microsoft Research, who cooperated in this project, isn’t concerned about this. As he says to CNN business, “I don’t think people need to start hiding their bags of chips just yet”

You may read the original paper here: http://people.csail.mit.edu/mrub/papers/VisualMic_SIGGRAPH2014.pdf