What is Visual Search?
Visual search (VS) makes it easy to sift through content without manually watching videos to find the subject of interest. VS uses computer vision (CV) to extract, label, and identify objects from digital media. The functional components of VS are:
Figure 1: A Functional Overview of Visual Search
Models: A model is an artificial neural network that helps identify objects embedded within the media.
Services: Applications can incorporate VS by integrating with APIs provided via three major services—media capture, CV, and search and indexing.
Computer Vision: CV is a field of artificial intelligence that trains computers to interpret and understand the visual world. The media goes through a standard set of stages to identify and verify extracted objects using the models.
Actions: Several actions may result from processing media or searching within media:
- Metadata labels extracted as part of media processing are indexed to support future searches for objects within the media.
- Search queries that span across several media files are collated as search results and presented on a web page or as a notification.
- Similarity searches using images are presented with the image representing the object, like in a shopping experience.
What Are the Benefits of Visual Search?
Organizations tend to accumulate large amounts of images and videos, which creates a challenge for finding specific objects embedded in the media. VS enables granular discovery of visual media by recognizing objects in media and noting timestamps of their occurrence, enabling deep links in the search result.
The benefits of Visual Search include:
- Increase productivity by 25% to 50% for personnel searching media, as searches now return results that link directly to instances of the searched item in the video.
- Improve sharing of video content across the organization by 25% to 50%, because search results now include a timestamp that directs recipients to watch the pertinent portions without scanning the whole video.
- Improve conversion rates by 20% to 30% for e-commerce and retail via similarity search for out-of-stock items by presenting similar items using image matching rather than just metadata matching.
- Reduce costs attributed to redoing tasks or procedures, especially in the healthcare and manufacturing industries, by up to 30%. In healthcare, for example, analysis of pictures of wound dressings sent by patients can determine when post-operative care is needed. Likewise, pictures can be analyzed on assembly lines for quality control.
What Are the Scenarios of Use?
The growth of video data is astronomical—approximately 500 hours of videos are created each minute, making it difficult to find content. Manually created metadata is structured and limiting because it fails to describe the objects within the video. CV extracts objects like faces, text, surveillance information, and manufacturing defects and applies labels to the objects it finds. The labeled objects are associated with a timestamp within the media, facilitating VS.
VS can be used in the following core scenarios:
- Enterprise Search: Augment search results by looking for text, faces, and product occurrences in video recordings.
-
Security: Surveillance of spaces and facilities for performing forensics.
-
E-Commerce: Recommend items similar to shoppers’ browsing experience and offer additional shopping choices.
-
Document Analysis and Tracking: Extract text from media for search, compliance, and automatic document verification for passports and driver licenses.
-
Manufacturing: Perform quality inspections by identifying visual defects like cracks or blemishes on recorded videos.
VS cannot exist on its own and needs to be a part of a larger solution. It can be included in a solution in the following ways:
- Custom integration with AI platforms and services: Integrate media capture and CV services into the application. The CV service offers search and indexing.
- Extend an enterprise video platform (EVP): Integrate a video platform as part of your media solution and integrate with a CV service if the platform supports custom metadata.
- Use CV services provided by the video platform: Some video platform vendors provide additional services for CV. The platform’s search and indexing service indexes extracted objects. This is an up-and-coming trend from video platform vendors.
What Are the Alternatives?
The only alternative to implementing VS is to add the metadata for search purposes manually. This entails significant labor costs for personnel to perform the job, invites inconsistency due to varying knowledge and diligence levels of personnel, and requires higher turnaround times to label content due to human productivity limitations.
What Are the Costs and Risks?
From a cost standpoint, organizations can expect to spend between $5,000 and $6,000 monthly to process two million images a month for visual search when using a service offering CV. Optical character recognition (OCR) and facial recognition cost less than object recognition. The costs vary based on how visual search is incorporated into the solution. Note: Cost is based on GigaOm’s research and may vary.
Integration and maintenance costs are a factor, and organizations must plan for video storage costs when media is acquired as video. EVPs tend to use video hours or storage used for pricing, and some offer OCR services with their base pricing. Additional CV services like facial and object recognition may incur additional costs by EVP vendors. Developer training and consulting costs should be factored in when skills are unavailable internally.
The most significant VS risks are security, system performance, and operations. Security risks include cyber intrusion, unauthorized access to data and privacy, software problems, and configuration issues. System performance risks are focused on incorrectly labeling objects, causing rework and a high volume of transactions. CV commonly uses the measures of false positives, which determine if an object was mistakenly identified, and false negatives, which determine if an object’s label was not verified correctly. When the false positives and negatives rates are unacceptable, the CV systems need retuning and retraining, and the models may need to be updated. Operations risks are loss of regulatory compliance and inadequate business processes.
Any CV application should also consider the following risks:
- Ethical: Does the solution violate business ethics when using data for incorrect purposes, is the data biased on demographics, is the application legal?
- Economic: Does the solution have a potential for lawsuits, or could it impact the organization’s reputation?
- Cultural: Is there close cooperation between teams to take advantage of the automation and prevent undesired outcomes that go undetected for long periods?
30/60/90 Plan
The path to adopting a SaaS-based visual search solution should be pretty quick. For general guidance, we highlight the following roadmap for 30/60/90 day deployment and adoption:
30 Days: Prepare
Identify VS workflow scenarios and the size of your video archive. Define the metadata of interest, seek appropriate CV platform vendors supporting visual search, and perform a cost-benefit analysis.
60 Days: Evaluate
Understand workflow integration challenges. Shortlist vendors, sign up for trials, and conduct POCs with a cross-functional team. Explain use scenarios to vendors, use vendor guidance for optimizing the solution, and understand how long it may take to index the video archive. Verify if the vendor supports the metadata model and provides support for custom metadata.
90 Days: Implement
Purchase the platform and set up sample workflows. Draw out plans for expanded use of VS. Set up metrics that drive key performance indicators to show benefits. Determine the need for a custom model, pre-processing, or post-processing to optimize the computer vision workflows for VS. Run tests to understand and document turnaround times for extracting and indexing metadata. Draw out plans to go live in production.