Researchers Build Smarter System to Judge How Well AI Edits Video

May 21, 2025 By: daviddefusco

Clockwise from left, Varun Biyyala, lead author of the study, a 2024 graduate of the M.S. in Artificial Intelligence and Katz School industry professor; Jialu Li, a co-author of the study and an artificial intelligence master’s student; and Bharat Chanderprakash Kathuria, an artificial intelligence master's student and one of the lead researchers.

By Dave DeFusco

In artificial intelligence, creating a flawless video edit, whether swapping skies in a sunset scene or generating a realistic animation from a still image, is no longer the main challenge. The real test is figuring out how well those edits were done. Until now, the industry has relied on tools that compare frames to captions or measure pixel changes, but they often fall short, especially when videos change over time.

Enter SST-EM, a new evaluation framework developed by Katz School researchers who presented their work in February at the prestigious IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). SST-EM stands for Semantic, Spatial and Temporal Evaluation Metric. It’s not a video editing tool itself, but a way to measure how well AI models perform the task of editing video—something no current method does effectively on its own.

“We realized that the tools currently used to evaluate video editing models are fragmented and often misleading,” said Varun Biyyala, a lead author of the study, 2024 graduate of the M.S. in Artificial Intelligence and an industry professor at the Katz School. “You could have a video that looks great in still frames but the moment you play it, the movement looks unnatural or the story doesn’t match the editing prompt. SST-EM fixes that.”

Most existing evaluation systems rely on a model called CLIP, which compares video frames to text prompts to measure how closely an edited video matches a desired concept. While useful, these tools fall short in a key area: time.

“CLIP scores are good for snapshots, not for full scenes,” said Bharat Chanderprakash Kathuria, an artificial intelligence master's student and one of the lead researchers. “They don’t evaluate how the video flows, whether objects move naturally or if the main subjects stay consistent. And that’s exactly what human viewers care about.”

The team also pointed out that CLIP’s training data is often outdated or biased, limiting its usefulness in modern, dynamic contexts. To fix these shortcomings, the team designed SST-EM as a four-part evaluation pipeline:

Semantic Understanding: First, a Vision-Language Model (VLM) analyzes each frame to see if it matches the intended story or editing prompt.
Object Detection: Using state-of-the-art object detection algorithms, the system tracks the most important objects across all frames to ensure continuity.
Refinement by AI Agent: A large language model (LLM) helps refine which objects are the focus, similar to how a human might identify the main subject of a scene.
Temporal Consistency: A Vision Transformer (ViT) checks that frames flow smoothly into each other—no jerky movements, disappearing characters or warped backgrounds.

“All four parts feed into a single final score,” said Jialu Li, a co-author of the study and an artificial intelligence master’s student. “We didn’t just guess how to weight them. We used regression analysis on human evaluations to determine what mattered most.”

That human-centered approach is what sets SST-EM apart. The team ran side-by-side comparisons between SST-EM and other popular metrics across multiple AI video editing models like VideoP2P, TokenFlow, Control-AVideo and FateZero. The results were clear: SST-EM came closest to human judgment every time.

The SST-EM score achieved near-perfect correlation with expert ratings, beating out every other metric on measures like imaging quality, object continuity and overall video coherence. On Pearson correlation, which measures linear similarity, SST-EM scored 0.962—higher than even metrics designed solely for image quality. In Spearman and Kendall rank correlation tests, which judge how closely the system’s rankings match human rankings, SST-EM scored a perfect 1.0.

“These numbers are not just good; they’re remarkable,” said Dr. Youshan Zhang, assistant professor of artificial intelligence and computer science. “They show that SST-EM evaluates video editing the way humans do—by considering not just what’s in a frame, but how frames connect to tell a coherent story.”

The team has made their code openly available on GitHub, inviting researchers and developers worldwide to test and refine the system. They’re also planning enhancements, including better handling of fast scene changes, cluttered backgrounds and subtle object movements.

“We see SST-EM as the foundation for the next generation of video evaluation tools,” said Dr. Zhang. “As video content grows more complex and ubiquitous, we need metrics that reflect how people actually watch and judge video. This is a step in that direction.”

�ƹ��Ƶ

YU News

News Channel