What’s the difference between human eyes and computer vision?

Image credit: source

Since the early years of artificial intelligence, scientists have dreamed of creating computers that can “see” the world. As vision plays a key role in many things we do every day, cracking the code of computer vision seemed to be one of the major steps toward developing artificial general intelligence.

But like many other goals in AI, computer vision has proven to be easier said than done. In 1966, scientists at MIT launched “The Summer Vision Project,” a two-month effort to create a computer system that could identify objects and background areas in images. But it took much more than a summer break to achieve those goals. In fact, it wasn’t until the early 2010s that image classifiers and object detectors were flexible and reliable enough to be used in mainstream applications.

In the past decades, advances in machine learning and neuroscience have helped make great strides in computer vision. But we still have a long way to go before we can build AI systems that see the world as we do.

Biological and Computer Vision, a book by Harvard Medical University Professor Gabriel Kreiman, provides an accessible account of how humans and animals process visual data and how far we’ve come toward replicating these functions in computers.

Kreiman’s book helps understand the differences between biological and computer vision. The book details how billions of years of evolution have equipped us with a complicated visual processing system, and how studying it has helped inspire better computer vision algorithms. Kreiman also discusses what separates contemporary computer vision systems from their biological counterpart.

While I would recommend a full read of Biological and Computer Vision to anyone who is interested in the field, I’ve tried here (with some help from Gabriel himself) to lay out some of my key takeaways from the book.

Hardware differences

In the introduction to Biological and Computer Vision, Kreiman writes, “I am particularly excited about connecting biological and computational circuits. Biological vision is the product of millions of years of evolution. There is no reason to reinvent the wheel when developing computational models. We can learn from how biology solves vision problems and use the solutions as inspiration to build better algorithms.”

And indeed, the study of the visual cortex has been a great source of inspiration for computer vision and AI. But before being able to digitize vision, scientists had to overcome the huge hardware gap between biological and computer vision. Biological vision runs on an interconnected network of cortical cells and organic neurons. Computer vision, on the other hand, runs on electronic chips composed of transistors.

Therefore, a theory of vision must be defined at a level that can be implemented in computers in a way that is comparable to living beings. Kreiman calls this the “Goldilocks resolution,” a level of abstraction that is neither too detailed nor too simplified.

For instance, early efforts in computer vision tried to tackle computer vision at a very abstract level, in a way that ignored how human and animal brains recognize visual patterns. Those approaches have proven to be very brittle and inefficient. On the other hand, studying and simulating brains at the molecular level would prove to be computationally inefficient.

“I am not a big fan of what I call ‘copying biology,’” Kreiman told TechTalks. “There are many aspects of biology that can and should be abstracted away. We probably do not need units with 20,000 proteins and a cytoplasm and complex dendritic geometries. That would be too much biological detail. On the other hand, we cannot merely study behavior—that is not enough detail.”

In Biological and Computer Vision, Kreiman defines the Goldilocks scale of neocortical circuits as neuronal activities per millisecond. Advances in neuroscience and medical technology have made it possible to study the activities of individual neurons at millisecond time granularity.

And the results of those studies have helped develop different types of artificial neural networks, AI algorithms that loosely simulate the workings of cortical areas of the mammal brain. In recent years, neural networks have proven to be the most efficient algorithm for pattern recognition in visual data and have become the key component of many computer vision applications.

Architecture differences

The recent decades have seen a slew of innovative work in the field of deep learning, which has helped computers mimic some of the functions of biological vision. Convolutional layers, inspired by studies made on the animal visual cortex, are very efficient at finding patterns in visual data. Pooling layers help generalize the output of a convolutional layer and make it less sensitive to the displacement of visual patterns. Stacked on top of each other, blocks of convolutional and pooling layers can go from finding small patterns (corners, edges, etc.) to complex objects (faces, chairs, cars, etc.).

But there’s still a mismatch between the high-level architecture of artificial neural networks and what we know about the mammal visual cortex.

“The word ‘layers’ is, unfortunately, a bit ambiguous,” Kreiman said. “In computer science, people use layers to connote the different processing stages (and a layer is mostly analogous to a brain area). In biology, each brain region contains six cortical layers (and subdivisions). My hunch is that six-layer structure (the connectivity of which is sometimes referred to as a canonical microcircuit) is quite crucial. It remains unclear what aspects of this circuitry should we include in neural networks. Some may argue that aspects of the six-layer motif are already incorporated (e.g. normalization operations). But there is probably enormous richness missing.”

Also, as Kreiman highlights in Biological and Computer Vision, information in the brain moves in several directions. Light signals move from the retina to the inferior temporal cortex to the V1, V2, and other layers of the visual cortex. But each layer also provides feedback to its predecessors. And within each layer, neurons interact and pass information between each other. All these interactions and interconnections help the brain fill in the gaps in visual input and make inferences when it has incomplete information.

In contrast, in artificial neural networks, data usually moves in a single direction. Convolutional neural networks are “feedforward networks,” which means information only goes from the input layer to the higher and output layers.

There’s a feedback mechanism called “backpropagation,” which helps correct mistakes and tune the parameters of neural networks. But backpropagation is computationally expensive and only used during the training of neural networks. And it’s not clear if backpropagation directly corresponds to the feedback mechanisms of cortical layers.

On the other hand, recurrent neural networks, which combine the output of higher layers into the input of their previous layers, still have limited use in computer vision.