How to avoid the pitfalls of Visual Search

John Cocco
CloudSight
Published in
3 min readJun 17, 2019

--

Photo by Gyorgy Bakos on Unsplash

Matching one image to another in a database has become commonplace among the e-commerce and social media sectors, and almost every other industry involved in digital media. From a development perspective, the goal has always been: “how do we take the image itself and find commonalities between it and other images?” The industry calls this “visual similarity.” It’s true, that this way of thinking may be beneficial when handling large amounts of data or in instances where exact image matches are required. However, as image recognition gains popularity in business and society, visual similarity’s flaws are becoming more exposed. Fortunately, semantic matching can help.

The idea behind semantic matching is to use computer vision trained to respond to images with a natural language description, or in other words, as a human would explain the image. Let’s say, for example, a user takes a picture of a coffee mug and the computer vision responds with: “Black and green Starbucks coffee mug.” From there, the caption generated from the computer vision (not the image data, or pixels themselves) would be used to match against captions of images already in a database. So what’s so special about matching the text rather than the images themselves?

While visual similarity may work fine with clean photos (i.e. formal product photos, stock photography etc.), it frequently fails when matching user generated content to either clean photos or other user generated content. For example, let’s say our coffee mug from earlier was sitting on a wooden coffee table when the user took the photo. There’s a chance, when using visual similarity, that the photo could match an image of a coffee table. Whenever the object isn’t prominently featured in the photo, or if it doesn’t exist in the database, visual similarity will incorrectly return other objects in the image or match the background instead.

This brings us to yet another issue — context. Let’s say our user’s photo of a coffee cup, while correctly focused, had a big Starbucks logo on it, and let’s assume that the visual similarity engine was well equipped with logo detection. Most of the time, a search through the database will return Starbuck’s merchandise (coffee grounds, tee shirts, etc.). However, what if our user wasn’t looking for Starbucks products, but rather just other mugs? Visual similarity wrongly assumes the intent of the user, while semantic matching will make an accurate generalization about what the user was searching for.

Although it’s sometimes taboo to talk about inaccuracy with technology, the topic seems to be warranted within the realm of image matching. When using images to search the web, the accuracy of the resulting images returned is fully dependent on either the relevancy of the image data (visual similarity) or the accuracy of the caption (semantic matching). However, what about a business use case where user generated photos are being matched against a small, specific catalogue? In this case, semantic matching can actually eliminate the system’s biases altogether.

Let’s take an example of a small e-commerce company who recently added a visual search capability to their application. Imagine a situation where a user of the application puts a friend’s shoes on their desk and snaps a picture to see what similar items the company had. Let’s also assume the company doesn’t carry shoes, but has lots of other items on desks in the images of their catalog. In this case, using visual similarity might not even return a result, or at best, the wrong result of other items on desks, like speakers. On the other hand, semantic matching would at least identify the shoes and return related footwear items. While this is an extreme example, it clearly illustrates that the errors in visual similarity are greatly reduced with semantic matching.

At the end of the day, the type of system that works well for your business is really up to the use-case itself and the size of the database. Although it is important to note that as the amount of users taking photos rises, the need for a computer to implicitly understand the context of these images will increase exponentially.

--

--