ⓘ This page has been translated using artificial intelligence.
First text, then images – and now videos too. AI video generators are still in their infancy in spring 2025. And yet, despite occasional glitches, shaky transitions and sometimes inconsistent logic, they are already producing impressive results. Generated videos are considered the next major milestone in AI because of their potential: models such as Runway Gen-4 and OpenAI Sora are expected to enable so-called ‘general world models’. These are AI systems that not only generate content, but also have a deep, physics-based understanding of the world.
Go directly to topic
Share this page
How is AI changing video production?
AI video generators are the next generation of image generation, combining images with movement. You can either create AI videos from scratch or modify existing videos. But how does this work technically, and what mechanisms are behind it?
Early approaches to AI video generation are based on image generation and string individual images together. Modern systems, on the other hand, strive for a physical understanding and recreate the world by adhering to these physical principles.
AnimateDiff: one of the early approaches, it is a further development of text-to-image models such as Midjourney, in which the generated individual images are moved dynamically. This is how it works:
The AI learns how to derive subsequent images from an original image by training with real video data. The AI strings the generated image series together to create a video.
The catch? Image 1 is simply appended to image 2 without conforming to an overarching script or the physical principles of our world. This often results in slightly psychedelic effects, showing objects that transform into one another or irritate the viewer with their slight jerkiness. The design varies from image to image, and the quality of the movement often simply does not feel real when viewed.
Video examples and further information about AnimateDiff(opens in new tab)
Sora (OpenAI): belongs to the category of ‘world building’. This means that AI video generators such as Sora combine transformer and diffusion models on the one hand and incorporate the space-time component (we explain what this is below) on the other. Here's how it works: With the help of the transformer model, the AI predicts the most likely next words. This allows it to develop your original prompt into a technically feasible prompt for the video. It could look something like this:
The diffusion model then implements this new prompt. It works through the individual images in several steps, starting from noise – up to this point, the process is the same as for image generation.
In order for these generated images to be assembled logically and harmoniously, the AI needs a basic understanding of physics. Sora achieves this through so-called space-time patches.
What are space-time patches?
AI develops its understanding of space-time by breaking down billions of videos into their smallest units (tokens) and analysing them: videos become individual images, individual images become areas of colour, areas of colour become colour pixels, and colour pixels become numbers. Through the laws governing these numbers, the AI learns the laws of our physical world – and how to calculate them itself.
Too abstract? The AI learns that when an apple (like other objects) falls, it always moves in a straight line towards the ground due to gravity. With this trained knowledge, OpenAI Sora can now drop the apple in the video onto the ground in a deceptively realistic manner.
Video examples and further information about Sora from OpenAI
‘I think about how the dog will move on in this one image and generate similar, sequential images.’
‘I have learned the physical principles that govern how the world works and generate a video based on my knowledge of how a dog moves when jumping.’
Have you acquired a taste for it and would like to generate a video yourself? You can find the most popular models and what makes them special here:
| Model | Provider | Clip length | Special characteristics |
| Veo 2(opens in new tab) | Google DeepMind | 8s (720p-4K) | Best motion physics, detailed scenes and variable styles, integration with Gemini and Vertex AI |
| OpenAI Sora(opens in new tab) | OpenAI | 20s (1080p) | Storyboard editor, ChatGPT integration |
| Runway Gen-4(opens in new tab) | Runway | 10s (30s Render) | High cinematic quality, fast processing, 4K export, consistent characters |
| Pika 2.2(opens in new tab) | Pika | 3-15s | Inpainting functions with creative effects for scene transitions |
| WAN 2.1(opens in new tab) | Wan AI, Alibaba | 2-3s (720p) | Open-source/free model, can display Chinese and English text well in videos |
Due to high demand, the market for video generators is also developing rapidly. There are already numerous video AIs available, and more are being added every day.
But with so much choice, it can be difficult to decide, right? That's why we recommend finding out about the specific capabilities and typical areas of application of the different models (you can also ask AI chatbots such as ChatGPT or Perplexity for advice) and then choosing the model that's right for you.
Incidentally, the Video Generation Arena Leaderboard(opens in new tab) provides an ongoing performance comparison.
When generating videos, you proceed in a similar way to prompting images. However, there are a few additional things to consider to ensure that you end up with the videos you want.
When generating videos, you proceed in a similar way to prompting images. However, there are a few additional things to consider to ensure that you end up with the videos you want.
The videos generated can be used for a variety of purposes:
When using video generators such as OpenAI Sora or Runway Gen-4, describe the desired scene in detail. Inform the AI about:
Tip: You can also enlist the help of a text AI and ask it to optimise your prompt for the video conversion.
Think of your video as a series of mini scenes with transitions in between. To ensure that the AI knows exactly what you expect from it, create a storyboard with clear directing instructions for each mini scene and transition. The storyboard function in OpenAI Sora helps you with this scene division.
Tip: Describe only one movement per scene. AI will adhere better to your specifications if you don't specify too many changes at once. If a lot is happening in the scene, ask yourself: Can I subdivide the scene further? This makes it easier for the AI and, in return, you get better results.
An example? Let's take our apple example again:
Scene 1: Summer atmosphere
Scene 2: Camera movement along the tree
Scene 3: The apple comes loose
Depending on the model, different aspect ratios (e.g. 9:16 or 16:9) are available. Since subsequent editing of the video can reduce its quality, it is best to consider the final format at the outset. Then, allow the AI to generate it directly.
AI video generation is not an exact science, but rather a creative process. And creative processes rarely run smoothly. So if it takes two or three attempts per scene to get the video to meet your expectations, be patient with the AI – and with yourself.
Tip: Small changes to the prompt can sometimes have a big impact. Here's another example:
If you are satisfied with the generated video, you can edit it one last time. You can use additional tools for this, e.g.: Recut(opens in new tab) allows you to shorten AI-generated videos or export specific sections. With Remix AI Video & Images from Google, you can edit specific elements in your video – for example, replace a person, change the background or generate a new movement.
Are you a visual learner or want to learn more? Then we recommend the AI tutorials from Futurepedia(opens in new tab).
Even if you did not generate a video yourself, you are still part of its distribution chain once you share it. Always be aware of this responsibility.
Author Dan Taylor Watt has compared numerous AI video generators in his blog, always using the same prompt to test the capabilities of the different systems. Here is an overview of five of the most popular generators.
Video generator: VEO 2.
Video generator: Sora.
Video generator: Runway Gen-4.
Video generator: PIKA 2.
Video generator: WAN 2.
A woman pushing a buggy across a zebra crossing whilst talking on her phone and walking her whippet.
Newer models achieve higher quality through physical understanding. Both images and videos in photorealistic style can appear deceptively real as a result. This brings with it both opportunities and risks.
We also consider ethical and social issues in our digital guide to generative image AI.
Video deepfakes are videos that have been manipulated using AI. This involves falsifying statements or misusing personal data to superimpose one face onto another. Celebrities are particularly affected, as a lot of digital data for face generation is available on the internet.
What exactly is a deepfake? Datenschutzgesetze.eu defines deepfakes as follows:
The term ‘deepfake’ refers to AI-generated or manipulated image, audio or video content that resembles real people, objects, places, facilities or events and would falsely appear to a person to be genuine or truthful.
Deepfakes are characterised by the use of AI for manipulation. Shallowfakes are conceptually distinct from deepfakes. They include fakes created using traditional editing and image processing programmes.
As AI continues to improve, it is becoming increasingly difficult to detect deepfakes. A few characteristics you can look out for to expose video deepfakes are:
Look at the proportions of the face and head – are they in proportion? With deepfakes, the head is sometimes slightly twisted or sits unnaturally on the body. The transitions from face to neck may also be worth a second look.
Pay attention to sudden jumps in the image, illogical camera angles or abrupt cuts. Look closely, especially during scene changes.
Are the image and sound synchronised? Especially in earlier deepfakes, the lip movements often do not match the spoken text perfectly. Check whether the mouth is forming correctly (especially for difficult words).
Our body language is complex and context-dependent. Deepfakes lack the natural connection between mind and body that intuitively controls our movements. The movements in deepfakes can therefore appear uniform or simply not quite match what is being said or a particular emotion.
A person's gaze reveals a lot, because even a glance can be a form of communication. So check: do the eyes appear lively? In deepfakes, the eyes are often fixed, empty or unnaturally shiny. Sometimes the blinking is also irritating because it is robotic or completely absent.
Are the light sources in the image logical and consistent? Do the shadows fall correctly and in the same direction everywhere on the face and body? This can be a valuable clue, as deepfakes can often be exposed by inconsistencies in the shadows.
The representation of hands is still a weak point in many models. Therefore, take a close look at the fingers of the AI and the people in the video: are there any strange finger positions or unrealistic situations, such as fingers overlapping or appearing to move through an object?
As with fake news, check the source of the video. Watch the video in full screen mode to see as many details as possible. And always remain sceptical and cautious: if you are unsure whether the content is true, it is better not to share the video.
Incidentally, there are now platforms that can help you expose deepfakes: Deepware scanner(opens in new tab), Deepfake-o-meter(opens in new tab), etc. However, depending on the technical sophistication of the platform, the results should be treated with caution (see this study from February 2025(opens in new tab)). Ultimately, the best tool is and remains common sense.
Test yourself in SRF's deepfake quiz: How good are you at recognising deepfakes?(opens in new tab)
We have compiled further information and content on the topic of ‘AI video generators’ here.