Motivation for the Enhancement of Machine Perception

Overview

In the future, we will interact with computers predominantly by means that are natural to humans: not by typing or clicking a mouse, but by speaking, looking, or simply being there. Computers, and the objects and environments in which they are embedded, will have a good understanding of the real world that is revealed to them both by their own sensors and by recorded multimedia content. Like humans, computers will prefer to use vision and audition to perceive the world, as these modalities offer the best combination of richness of information, versatility of analysis type, and scalability of implementation. Computers will become more ubiquitous, but more friendly and more "human"; cyberspace will dissolve into the real world, and computing will become accessible to and inclusive of those who do not have special "computer skills".

Reaching this future will not be trivial, but the process is well under way, and is accelerating. The dramatic increases in computational power, memory storage capacity, and availability of multimedia input and output devices has led to a recent explosion of advances in the fields of computer vision, speech and language processing, and analysis of audio, video, and images. The tools for building systems with perceptual and communciation capabilities that rival those of people are just now becoming widely available, and researchers are rapidly making inroads toward the ultimate goal of perceptive, intelligent computing. A vast area of important intellectual property and business opportunities is being explored and staked out, and any company that hopes to be relevant to the world of computing a few decades from now - a world that promises to be much more expansive than it is now - should get in on this land grab as soon as possible. Luckily for Hewlett-Packard, it is still well-positioned to be a leader in this future, provided that it embraces and pursues the changes that are to come, rather than wait for and react to them.

This document is intended to encourage and justify research that will help bring about a future of computers that understand the natural world and that communicate with people as they do with each other. If you already believe all of the above, there is no need to read further, except perhaps for your own enjoyment or inspiration. For the skeptical, the above is repeated below in a longer, and hopefully more convincing, form.

Where Human-Computer Interaction is Headed

Despite the popular notion of computers as powerful "thinking" machines, the typical computer is no more aware of the world around it, and no better equipped to perceive and communicate with people, than an equivalently-sized block of stone. The average computer will not recognize you when you walk into the room, nor will it know that anyone is there at all, unless you type on its keyboard or move its mouse. It also will not know the difference between being placed in a wheat field in Kansas, on a street-side during a Mardi Gras parade, or in an 8.9 magnitude earthquake in San Francisco, except for a possible disruption of its power supply or network connection. To a computer, the world consists only of files of various types, of streams of bytes, of IP addresses, of device drivers and of communication bus protocols. It perceives and interacts with this world via its operating system, and a person must understand this same operating system in order to manipulate the computer's world and put the computer to use. Because we are forced to deal with computers on their own terms, and because the world of cyberspace in which they operate bears little resemblance to the reality in which people live every day, human-computer interaction typically feels difficult, limited, and unnatural. When it does not, it is usually because we have adapted to the computer's means of communication and thinking.

It would seem preferable to have computers adapt to us, and communicate and understand our world, rather than the other way around. People could then begin to communicate with computers, and any appliance or other object in which we choose to embed a computer, in the same ways that people communicate with each other:

By actively trying to understand the state of the world around itself, a computer can enhance the applications it provides by adapting them in context-sensitive manners without being told to do so: By attempting to recognize the presence, identities, moods, and desires of the people interacting with it, a computer taps into a rich channel of communication that requires little effort on the part of the user: In short, as computers better understand natural forms of human communication and the world in which human beings live, cyberspace and the real world will more seamlessly merge, a broad range of new applications that empower and are enjoyable to people will be enabled, and the need to have "computer skills" in order to use a computer will largely go away.

As we better enable computers to understand the real world, we will also be able to automate tasks that currently require human perception. For example:

The automation of human perception will likely bring about substantial societal changes, in some ways similar to those that occurred during the Industrial Revolution, when machines first started replacing human labor in a variety of tasks. Just as happened then, various types of jobs will largely disappear, and some temporary unpleasantness may occur whenever change is implemented without sufficient forethought. On the whole and in the end, however, people will find themselves more empowered, more productive, less concerned with mundane tasks, and more free to pursue whatever interests attract them.

(Note: As computers are made more perceptive and intelligent, it will also become increasingly important to structure their behavior so that they continue to serve us, rather than forcing us to serve them. We would not want all the devices in our home constantly calling out for our attention or demanding answers, nor do we want them to seem like an omniscient, over-bearing presence from which we cannot hide. There is little danger of this happening on a large scale, of course, because most consumers will not waste their money on technology that makes them less happy.)

How Do We Get There?: Give Computers Sight and Hearing

To build perceptive, intelligent computers that understand natural forms of human communication, it makes sense to model them to a large extent on the most perceptive, best communicating machines that exist: human beings. More of the human brain is devoted to visual processing and visual thinking than to any other form of sensory input; audition (the sense of hearing) ranks second. On the other hand, when human beings communicate with each other, speech and audition are dominant, with visual perception of gestures, eye contact, and expressions playing a strong supporting role. It would seem wise, then, to focus heavily on technologies such as computer vision, speech recognition, natural language processing, and other forms of audio, video, and image analysis, as we begin to build appliances, devices, and computers that are aware of and can interact with the natural world.

It is important to realize, however, that for many applications, sufficient sensory perception can be achieved by means that do not rely at all on vision or audition. For instance, if the location of people in some space must be known in order to adjust the environment in some way, a computer vision specialist might immediately suggest that cameras be used to detect and track people. On the other hand, this information could also be obtained, perhaps more cheaply, accurately, or reliably with today's technology, by instrumenting the floor of the space with pressure sensors, or by requiring the people in the room to wear some sort of radio-frequency-emitting badge whose position can be triangulated using a collection of receivers. Similar choices between alternative technologies appear in many such contexts, and one should not necessarily force the use of vision or audition just because this is how human beings themselves would try to do the task.

Nevertheless, as we think about how to go forward in instrumenting computers and environments for perception, it can be strongly argued that the focus should remain on vision and audition. Most importantly, both of these provide much richer, more versatile sources of input than most alternatives, thereby allowing a much greater range of sensory analysis and understanding to be achieved. For example, if we were to instrument that floor with pressure sensors, we might indeed know where exactly all the people are, and maybe even how much they weigh, but it would be difficult or impossible to determine which way their bodies are facing, if they are standing or sitting, where they are looking, what their facial expressions are, who the people are, whether or not they are carrying something, and so on. We might be able to instrument the environment with additional types of sensors that explicitly analyze each of these things, but all of this information could have been obtained with the same set of cameras. Similar analogies apply to microphones, which can be used, among other things, to physically locate sound sources, determine the identity and type of a sound source, understand speech, and determine the speaker's mood. In short, as algorithms for doing sophisticated visual and auditory analysis are further developed, it will become much more cost effective and rewarding to rely on cameras and microphones, rather than other technologies, to provide sensory awareness to machines.

In addition, because vision and audition are passive technologies that do not require any sort of special consent from or instrumentation of the objects being observed, they are applicable in a broader range of contexts than many other technologies. For instance, applications that require users to wear some sort of "active" badge will not function for people who do not know about this requirement or refuse to accept it. A home surveillance system will fail to locate thieves who "forget" to put on their badge, and perhaps your "intelligent" front door will not open if you have somehow lost or broken your badge. Should you keep extra badges in a dish on your coffee table, for use by guests when they come to visit, and how many will you need? A store that tries to understand customer shopping patterns by tracking them as they shop might decide to rely on a badge system, but then it must determine how to entice (or trick) people into carrying the badges, and it must face the strong possibility that the behavior of shoppers who use the badges is not entirely representative of the shopper population as a whole. One must also consider the possibly exorbitant price of adding badges or tags to all of the objects to be sensed. To keep track of which objects in a supermarket are moved or touched, it may not be cost-effective to place an active label on every single item; similar issues arise for refrigerators that want to know what they contain, or homes that want to understand the configuration of all objects within them. Furthermore, if the badges are not at all cheap, say on the order of $30 or more, people with low or moderate income may find this cost a barrier to usage if they are asked to pay for it.

The fact that vision and audition also do not modify their environment makes them preferable, for many applications, to sensing technologies that rely on the emission of electromagnetic radiation, sound, or other signals. While these alternatives are often more accurate than vision or audition, they also often use more power, and are often more costly to implement. In some cases, it is not yet well known whether the emissions themselves may be hazardous to human beings or other organisms. Also, many such devices operate less effectively, if at all, when a similar device is in the area, as their emissions start to interfere with each other. These last two issues become increasingly important when people attempt to deploy such technologies on a wide scale. In general, it does not seem desirable or practical to build a world full of devices that are flashing infrared light, firing lasers, emitting ultrasonic squeaks, and packing the airwaves at all frequencies with broadcast transmissions.

Given all of the above, vision and audition appear to be the most intelligent bet in the long-term for scalable, flexible, reliable, and powerful sensing systems. It is not a coincidence that biological evolution has settled on this same answer. Alternative technologies may better solve our short-term problems, but should not distract us from pursuing answers that better fit the direction the world is ultimately going.

Why Now, and Why HP?

The most advanced known device for vision and audition is the human brain. There are about 100 billion neurons in the brain, and each one can be thought of as a relatively simple processor and some memory. By massively interconnecting these processors and using them in analog (non-clocked) fashion, the brain achieves tremendous computational power, well beyond that of any computer ever built by human beings. It is not surprising, therefore, that man-made computer systems for video, audio, and image analysis, real-time vision, speech recognition, and natural language processing have demonstrated quite limited capabilities in comparison to the average human being.

The last few years, however, have seen a dramatic increase in the availability of computational power, so much so that it has become relatively straightforward to build machines whose capabilities may rival that of significant portions of the brain. For a few thousand dollars, one can buy 100 billion bytes of memory at the local computer store. Processors capable of billions of operations per second are being made by mainstream hardware companies, and are available for sale in standard computers. Even more powerful processors come as part of your $300 Sony Playstation 2. For tasks that are beyond either of those architectures, it is relatively straightforward to build custom boards from DSP chips, field-programmable gate arrays, fast on-board memory, and other easily obtained components. The trends toward cheaper memory and faster processors have been and will continue to be rapid, so that the numbers listed above will probably seem laughably quaint in a few years. We are at or near the point at which virtually any reasonable algorithm we might want to use can be made to run, with less than a year's development time by a small team, at practical speeds in some combination of hardware and software. In addition, once any such system is found to be useful, the cost of its production may usually be brought down to levels accessible by the mass consumer market without terrible difficulty.

While raw computational architectures have been advancing speedily, devices for taking in or generating sensory information, such as cameras, microphones, displays, and speakers, are all becoming smaller, cheaper, and more accurate. Consequently, they are becoming more ubiquitous:  sales of "web cams" are following an exponential trend, micro-displays that fit in your glasses are headed for mass production, laptops are shrinking but come with increasingly impressive multimedia capabilities, watches and cell phones with embedded cameras are now on the market, and "wearable" computers have been a trendy topic for years now. Hence, although computers and other objects (your refrigerator; your car; your jeans!) are largely unable to see, hear, talk, gesture, or otherwise interact with and understand the real world, they now often have or can easily be given the raw machinery to do so.

Given the wide availability of tremendous computational power and the means to connect computers in a sensory manner with the world around them, it should not be surprising that the last decade or so has seen an explosion in advances in computer vision, speech recognition, and other forms of audio and video processing. The pace of breakthroughs has been accelerating, and the community of researchers in these fields numbers in the tens of thousands. Researchers are also drawn to these areas, no doubt, by the fact that, among all computer science and engineering areas, these remain the among the ones in which accomplishments seem to have fallen farthest short of what we believe ought to be possible. While disciplines such as microprocessor design, network communications, information theory, and development of programming languages and tools have all produced monumental achievements with profound impact on our world, researchers in audio and video processing continue to struggle to make machines do "basic" things like recognize spoken or written words, find and recognize the faces of people that work with them every day, detect scene changes in a movie, or speak in a way that sounds plausibly human. The building of perceptive, intelligent computers represents a vast, wide-open space of intellectual territory to be discovered and claimed, and money from universities, governments, and corporations is flowing ever more rapidly into the hands of those researchers who are eager to take on the challenge.

Much of why Hewlett-Packard should be concerned with this area, therefore, applies to any company or institution that hopes to be a driving force in the computing industry fifty, or maybe just twenty, years from now: personal computers and appliances will become more perceptive, more intelligent, and better able to employ and understand natural forms of human expression, and people will like this and prefer to buy and use it over what we have now, so anyone who hopes to be an important player in providing computing technology in the future should get on this bandwagon before it is too late. A huge paradigm shift in the usage of computers and embedded intelligence is now at its beginning stages, and technology companies that fail to help make it happen will miss out on the biggest of opportunities, if not find themselves cast aside as irrelevant.

Of course, it is never absolutely "too late" for a company to make a commitment to some new area of technology, but if a company waits until others have created mature products based on this technology, have locked up substantial portions of relevant intellectual property, and have long been following a business strategy aimed at capitalizing on this technology, that company will be at a very substantial competitive disadvantage and should not hope to soon become a leader in the field. It is also possible, on the other hand, to pour resources into developing a new technology too soon, so that the effort is largely wasted and futile. Based on the above discussion, though, it does not appear to be too soon to invest in building perceptive, intelligent computing. Supporting technologies, such as processor power, computing architectures and manufacturing, devices for audio and video capture and presentation, and software for audio and video manipulation, storage, compression, and transmission, are all very well developed, if not crying out to be put to some more challenging tasks. There is nothing stopping us from moving quickly to improve on computers' understanding of and communication with the real world, and most major industrial and university research laboratories are doing just that.

There are additional reasons why HP, in particular, is well-suited to play an important role in developing tomorrow's intelligent computers. First, as a leading maker of personal computers and PDAs, it has the leverage to introduce innovations in computer hardware on a mass market scale, and thereby create opportunities for itself and forge its own path into the future. HP does not need to wait for someone else to build a better camera or a special-purpose media processor, nor would it necessarily need to wait silently for others to develop software and algorithms to take advantage of such things. Instead, HP can undertake major engineering efforts of its own, use partnerships to make the rest happen, and bundle the results with its current computer hardware offerings in order to make them more attractive. HP's expertise in making computers also means that they can relatively easily transfer this technology to new domains such as the home entertainment center, the kitchen, or the wrist watch, where computers are little used today. If HP comes up with some application that would require a significant augmentation in the computational power of some everyday object, HP has the know-how to build this power into the object in an affordable, high-quality way. HP's leadership in the imaging and printing businesses also makes it an ideal candidate for developing state-of-the-art video-based communication and understanding. The wealth of image processing and image manipulation expertise that HP has built over the years would give it a strong advantage in any attempts it might make to help computers "see" the world. Finally, HP has a strong culture of invention, an impressive collection of research labs and talent, and a healthy existing business to fund exploration into new areas; this combination is rivaled by few companies and is perfect for developing new paradigms of computing.
 
 

Michael Harville
February 20, 2001

I think I've managed to generate some good raw material here, but this is still a work in progress. I would appreciate hearing any feedback you might have: what you liked, what turned you off, things that were left out, and edits of any kind. If you have the time, I would be very grateful for any comments you might be able to send via email to michael_harville@hp.com .