Description

In the early days of our community when the idea of intelligently managing visual information was becoming quite exciting, text was downplayed, and audio was largely ignored as being in the purview of speech processing, an already established field of study. We strongly believe that now is the time to introduce and formalize the more general field of study of what we call sensor-based data management, and to bring the multimedia community and the natural language processing closer together.

Most multimedia objects are spatio-temporal simulacrums of the real world. This supports our view that the next grand challenge for our community will be understanding and formally modeling the flow of life around us, over many modalities and scales. As technology advances, the nature of these simulacrums will evolve as well, becoming more detailed and revealing to us more information concerning the nature of reality.

Currently, IoT is the state-of-the-art organizational approach to construct complex representations of the flow of life around us. Various, perhaps pervasive, sensors, working collectively, will broadcast to us representations of real events in real time. It will be our task to continuously extract the semantics of these representations and possibly react to them by injecting some response actions into the mix to ensure some desired outcome.

Linguistics is divided into three broad areas: syntax, semantics, and pragmatics. The multimedia community in computer science is very well represented in the first two areas. However, work in the pragmatics area is in its infancy.

Pragmatics studies context and how it affects semantics. Context is sometimes culturally, socially, and historically based. For example, pragmatics would encompass the speaker’s intent, body language, and penchant for sarcasm, as well as other signs, usually culturally based, such as the speaker’s type of clothing, which could influence a statement’s meaning. Generic signal/sensor-based retrieval should also use syntactical, semantic, and pragmatics-based approaches. If we are to understand and model the flow of life around us, this will be a necessity.

Our community has successfully developed various approaches to decode the syntax and semantics of these artifacts, or at least the dominant semantics, as image snippets (bags of visual words) are more polysemous than text. The development of techniques that use contextual information is in its infancy, however. Artistic media, such as painting, sculpture, performance art, movies, and others, bring along more contextual baggage than other sorts of media. With the expansion of the data horizon, through the ever-increasing use of metadata, we can certainly put all media on more equal footing.

The NLP community has its own set of approaches in semantics and pragmatics. Natural language is certainly an excellent exemplar of multimedia, and the use of audio and text features has played a part in the development of our field. However, if we are to develop more unified approaches to modeling the flow of life around us, both of our communities can certainly benefit examining in detail what the other can offer. Many approaches are the same, but many are different. Certainly, the research in many areas, such as word2vec, from the NLP community can have a positive benefit to the multimedia community.

Now is the perfect time to actively promote this cross-fertilization of our ideas in order to solve some very hard and important problems.