Brandon Birmingham writes about fine tuning deep learning models for image captioning by mapping visual and linguistic data.

“We may hope that machines will eventually compete with men in all purely intellectual fields. But which are the best ones to start with? Even this is a difficult decision. Many people think that a very abstract activity, like the playing of chess, would be best. It can also be maintained that it is best to provide the machine with the best sense organs that money can buy, and then teach it to understand and speak English. This process could follow the normal teaching of a child. Things would be pointed out and named, etc. Again I do not know what the right answer is, but I think both approaches should be tried.” So wrote English computer science pioneer Alan Turing in Computing Machinery and Intelligence (1950).

Following Turing’s inspiring vision, one of the dreams to be accomplished in the field of artificial intelligence (AI) is to empower computers with human abilities, including the visual ability to see and make sense of the world around us, and enrich them with the ability to communicate in a natural language.

The involved complexity behind visual recognition and the way visual perception is translated and communicated by natural language is intrinsic to the human being. Through an incremental and hierarchical learning experience, the human being learns to describe the visual world in a natural language by being exposed to an environment which offers direct mapping between the visual world and the spoken language. This process leads to accumulated knowledge which combines visual perception and human language originating from word-learning ability.

This starts at an early stage when children naturally start parsing utterances into distinct words during their interactive experiences. Children become better at parsing utterances into words, at dividing the world into possible referents for words and at figuring out on what people are referring to when they speak.

After the initial few spoken words, children start acquiring enough understanding of the language to be able to learn from linguistic context.

At this point in time, children start exploiting the syntactic and semantic properties of utterances to which new words belong as they build their initial set of vocabulary. This enables the learning of many more words, including those that could only be acquired though this linguistic scaffolding. Humans eventually learn how to communicate with others and how to provide meaningful representations of our world we live in through a continuous learning experience that constantly matures, improves and adapts.

Describing the visual world by natural language for most human beings is a very straightforward and effortless task. By having a quick glance at an image or any visual representation, a person can easily identify various composing entities and describe the main activity effortlessly. Despite the fact that this ability is an offhand task for most human beings, artificial visual interpretation is a very complex and challenging task. Digital images are encoded and represented as a large matrix of numerical data representing colour brightness at each single point. From these thousands or even millions of colour-coded pixels, computer vision algorithms are designed to transfer patterns of pixel values into a semantic meaning that can effectively describe the content of an image.

The recognition of image objects, which plays a vital role for visually analysing an image, has seen major and rapid advancements in the recent years. In fact, the current state-of-the-art image recognition models built on top of deep neural architectures are now capable of classifying thousands of object classes with human-comparable accuracy. While object detectors can be portrayed as the fundamental building blocks for automatic image captioning, such detectors can only produce descriptions as a laundry list of object categories.

Such descriptions pale in comparison to the linguistic structure and naturalness that a human description is normally composed of. The advancements in computing power and the sheer amount of data that is continuously being generated, made it more possible than ever to fulfil the goal of artificially describing the natural world using human-like captions. The problem of automatically generating concise image captions has recently gained immense popularity from both computer vision and natural language processing communities.

The conventional process of automatically describing an image fundamentally involves the visual analysis of the image content such that a succinct natural language statement, verbalising the most salient image features, can be generated. The generation of image captions is also extensively dependent on natural language generation methods for constructing linguistically and grammatically correct sentences. Describing image content is very useful in applications for image retrieval, based on detailed and specific image descriptions, caption generation for enhancing the accessibility of existing image collections, enabling human to robot interaction, and as an assistive technology for the visually impaired people.

Arguably, the most difficult aspect in image description generation is bridging the gap between the image visual analysis and its corresponding linguistic description. In fact, this task is an emerging research problem that attempts to provide more relevant and human-like image descriptions. State-of-the-art models in this area are trained with large amounts of data composed of images and corresponding image captions. These models, which are designed to mimic the human brain, are trained to learn the mapping between visual and linguistic data.

Specifically, the AI models are designed to automatically find the relationship between the two modalities. Best performing models follow an encoder-decoder pipeline. Images are first encoded by a stream of representations generated by a neural network which operates on pixel data. These representations are then passed through a decoder trained to translate the visual representations into the linguistic domain. Current best performing models are based on supervised learning and therefore require a significant amount of data. We know that no one has ever seen thousands of images containing cats to learn the concept of a cat; however, current best AI models do so.

Despite the fact that the learning process is dependent on large amounts of high quality and unbiased data, current best performing automatic image captioning still are prone to make both visual and linguistic errors. One fundamental reason is the non-cumulative and non-hierarchical learning principle that is empowering current deep learning based models which therefore contradicts the life-long learning and reasoning ability that humans possess. The latter could be the next step towards general AI, which will not just solve the problem of automatic image captioning but which can also be used in a “very abstract activity, like the playing of chess”.

Brandon Birmingham B.Sc. (Hons.), M.Sc. (Melit.) is currently researching artificial intelligence with a focus on machine learning and image understanding.

Sign up to our free newsletters

Get the best updates straight to your inbox:
Please select at least one mailing list.

You can unsubscribe at any time by clicking the link in the footer of our emails. We use Mailchimp as our marketing platform. By subscribing, you acknowledge that your information will be transferred to Mailchimp for processing.