AI Killed The Video Star
Exploring Sora's impact: From transforming filmmaking to potential misuse in misinformation, AI's future holds transformative promise.
If you found yourself questioning your place in the world after OpenAI demoed its new text-to-video model, Sora, earlier this week, you weren't alone. I was right there with you. Even YouTube’s MrBeast questioned his potential future.
For those who might have missed it, and I’m not sure how you could have, OpenAI announced Sora. Sora is a groundbreaking model capable of interpreting text, image, or video inputs. It generates 60-second photorealistic videos or cute 3D animations that are so convincing, you'd think they were real (or created by humans) unless you examined them frame by frame. The demo footage is so good that, for the most part, you’d think it real unless you dove in frame-by-frame.
YouTuber MKBHD has said it pretty well a number of times, including in the video above: “just remember, this is the worst this technology will be from here on out.”
So, then, what could the future look like?
Even More Compute
Generative models like OpenAI’s GPT-4 require substantial computing power for training and for generating outputs for their users. These models are working solely with text. Still, models like that have aided in driving stocks in companies like Nvidia up over 200% in the last year due to their reliance on advanced GPUs for processing.
Even just this past month, companies like SMCI are up over 100% on the backs of news that AI server demand will soar.
Well, buckle in, because that demand isn’t likely to go anywhere. Sora, in a white paper published by OpenAI calls out that throwing more compute at the problem is all that’s needed to generate better and higher quality videos, pretty obvious, right?
In the video shown above, you can see that “base compute” gives us some unrecognizable blob of a video. Something we’d know for sure is computer generated. At 4x compute, the video begins to take on recognizable shapes, hinting at the potential for realism. By the time we reach 32x compute, the video, albeit slightly imperfect, convincingly mimics reality to the extent that it could pass the Turing test for video authenticity.
Getting to this level of “real”
Sora has achieved unprecedented levels of realism by venturing beyond conventional approaches and, undoubtedly, at significant expense.
Competitors in the space of video generation, like Runway or Pika use diffusion models to generate video. Diffusion models work similarly to an artist sketching a portrait, starting with broad strokes and gradually refining the details until the picture comes to life.
OpenAI’s Sora uses these same diffusion models, but it blends them with transformers. Just as LEGO bricks connect to build complex structures, transformers assess the 'blueprint' of preceding frames to construct the logical next step in the video sequence. They look at the frame before it and say “ok, what’s the next best thing I can do?” In essence, and yes I am going to write it, a predictive text generator.
By combining diffusion models with transformers, we got Sora. From here, it only gets better.
Cheaper, Faster Filmmaking
The obvious beneficiary of this technology is filmmakers, but not necessarily everyone engaged in the making of films.
Technology like Sora, while new, will play a massive role in the making of films and there’s no doubt about that. Special effects? Those will become cheaper. Need a scene shot at the Eiffel Tower? Why not create it from your studio in New York.
Lighting, actors, makeup, all of these too could become elements that are made cheaper or potentially eliminated by the technology.
These are not changes that will come overnight, but they will come, and they will shape the entertainment business in a massive way over the next decade. Much like YouTube or TikTok has shaped media consumption over the last twenty years, I believe we can expect the same from these models.
Companies like Runway have already have an “AI” hand at play in helping generate a number of effects for feature films (Everything Everywhere All at Once), and have even started their own AI Film Festival for short 1-10 minute films.
1-10 minutes will soon grow to 20-30 minutes, and then eventually we will be sat watching movies that utilize AI from start to finish. The days of mega-budget movies are likely numbered.
Unlikely to happen, “Hey AI, make me a movie about X doing Y.” There has been a lot of speculation that AI could allow for rapid movie generation, but storytelling is so nuanced and a shared-experience that it just seems very far off, and not really in human nature to want it. Should it become the case, people will start to crave “human made” media.
A question will, and should, arise here of “what about the jobs?” I don’t think it’s even speculative at this point to suggest we don’t need as many animators, special effects artists, concept artists, or even writers working on movies. It likely becomes that we don’t need as many “crew” either. Fewer lighters, makeup artists, and even actors on scene.
These have been, in the past, viable career paths for millions and adapting will take time. We will not find answers immediately, but humans have always handled progress well, and I’d hope we can stand up to the challenge this time around too.
Effective Misinformation
As filmmaking becomes revolutionized by AI, so too does the potential for crafting believable misinformation, presenting unprecedented challenges.
Today, people believe all manner of text-generated content, no matter how outlandish it may seem (hello, flat-earthers). Imagine now a video to go along with it.
A video that can be produced with a few lines of text showing a flat world. Or a video that shows the sitting President of a nation doing illegal activities while on vacation. I don’t think these are a case of if, but just a matter of “when.” With a major election coming up in the United States less than 9 months from now, I imagine AI and the ethics surrounding it will be dragged right to the forefront of public discussion.
As we stand on the brink of a new era in content creation, the imperative for ethical guidelines and robust safeguards against misuse has never been clearer.
Simulated Worlds
Another potential future for Sora is simulated worlds. As computing power grows and these models become more efficient, simulated worlds could support many use cases.
What is a simulated world? Well, in Sora’s case it can take footage and use the same diffusion and transformer models we spoke about above to “predict” the many possible paths for what could come next. This means Sora could analyze existing footage and, using its advanced algorithms, generate multiple future scenarios based on any given moment. Thinking of the Minority Report? You’re on the right path.
Steering away from the dystopia of it, there are great use cases for this type of technology. In filmmaking, directors could explore alternative storylines with the push of a button, or game developers could offer players dynamically evolving worlds that respond to their actions in real-time. In engineering, too, simulating how various structures or components might behave and actually getting to visualize it.
A Bright Future
Technology like Sora will transform jobs, that is certain. It will shape how we make and consume media, there’s no doubt about that. Still, I don’t think we need to be scared of it just yet.
It’s unproven in the field, and it will take a tremendous amount of computing power to create those 60-second videos. This will shift, and a decade from now I can look back on these words and may laugh at just how naive I was.
For now, take a deep breath and do what’s seemed to work for everyone so far: buy Nvidia (I kid, of course, make sure you do some due diligence there).
AI may kill the video star as we know it. While we don’t know what the other side looks like just yet, here’s hoping it’s positive for all.