By Nick Gold
In Hollywood, the promise of artificial intelligence is all the rage. Who wouldn’t want a technology that adds the magic of AI to smarter computers for an instant solution to tedious, time-intensive problems? With artificial intelligence, anyone with abundant rich media assets can easily churn out more revenue or cut costs, while simplifying operations … or so we’re told.
If you attended IBC, you probably already heard the pitch: “It’s an ‘easy’ button that’s simple to add to the workflow and foolproof to operate, turning your massive amounts of uncategorized footage into metadata.”
But should you take the leap? Before you sign on the dotted line, take a closer look at the technology behind AI and what it can — and can’t — do for you.
First, it’s important to understand the bigger picture of artificial intelligence in today’s marketplace. Taking unstructured data and generating relevant metadata from it is something that other industries have been doing for some time. In fact, many of the tools we embrace today started off in other industries. But unlike banking, finance or healthcare, our industry prioritizes creativity, which is why we have always shied away from tools that automate. The idea that we can rely on the same technology as a hedge fund manager just doesn’t sit well with many people in our industry, and for good reason.
In the media and entertainment industry, we’re looking for various types of metadata that could include a transcript of spoken words, important events within a period of time or information about the production (e.g., people, location, props), and currently there’s no single machine-learning algorithm that will solve for all these types of metadata parameters. For that reason, the best starting point is to define your problems and identify which machine learning tools may be able to solve them. Expecting to parse reams of untagged, uncategorized and unstructured media data is unrealistic until you know what you’re looking for.
What works for M&E?
AI has become pretty good at solving some specific problems for our industry. Speech-to-text is one of them. With AI, extracting data from a generally accurate transcription offers an automated solution that saves time. However, it’s important to note that AI tools still have limitations. An AI tool, known as “sentiment analysis,” could theoretically look for the emotional undertones described in spoken word, but it first requires another tool to generate a transcript for analysis.
But no matter how good the algorithms are, they won’t give you the qualitative data that a human observer would provide, such as the emotions expressed through body language. They won’t tell you the facial expressions of the people being spoken to, or the tone of voice, pacing and volume level of the speaker, or what is conveyed by a sarcastic tone or a wry expression. There are sentiment analysis engines that try to do this, but breaking down the components ensures the parameters you need will be addressed and solved.
Another task at which machine learning has progressed significantly is logo recognition. Certain engines are good at finding, for example, all the images with a Coke logo in 10,000 hours of video. That’s impressive and quite useful, but it’s another story if you want to also find footage of two people drinking what are clearly Coke-shaped bottles where the logo is obscured. That’s because machine-learning engines tend to have a narrow focus, which goes back to the need to define very specifically what you hope to get from it.
There are a bevy of algorithms and engines out there. If you license a service that will find a specific logo, then you haven’t solved your problem for finding objects that represent the product as well. Even with the right engine, you’ve got to think about how this information fits in your pipeline, and there are a lot of workflow questions to be explored.
Let’s say you’ve generated speech-to-text with audio media, but have you figured out how someone can search the results? There are several options. Sometimes vendors have their own front end for searching. Others may offer an export option from one engine into a MAM that you either already have on-premise or plan to purchase. There are also vendors that don’t provide machine learning themselves but act as a third-party service organizing the engines.
It’s important to remember that none of these AI solutions are accurate all the time. You might get a nudity detection filter, for example, but these vendors rely on probabilistic results. If having one nude image slip through is a huge problem for your company, then machine learning alone isn’t the right solution for you. It’s important to understand whether occasional inaccuracies will be acceptable or deal breakers for your company. Testing samples of your core content in different scenarios for which you need to solve becomes another crucial step. And many vendors are happy to test footage in their systems.
Although machine learning is still in its nascent stages, there is a lot of interest in learning how to make it work in the media workflow. It can do some magical things, but it’s not a magic “easy” button (yet, anyway). Exploring the options and understanding in detail what you need goes hand-in-hand with finding the right solution to integrate with your workflow.
Nick Gold is lead technologist for Baltimore’s Chesapeake Systems, which specializes in M&E workflows and solutions for the creation, distribution and preservation of content. Active in both SMPTE and the Association of Moving Image Archivists (AMIA), Gold speaks on a range of topics. He also co-hosts the Workflow Show Podcast.