OpenAI shared the research process of Sora, an AI model that creates text videos. Sora is a model that allows you to create videos of up to one minute in length based on text prompts. Example videos produced in various styles, including live-action movies, drone filming, and 3D animation, were released on the webpage.
The video is smoother than any AI video creation tool released so far. There are no unnecessary frames that seem to tremble, and moving and non-moving elements are clearly distinguished. The subject and background are separated depending on the focus, and the details are as detailed as a still image. It feels like watching 3D special effects that first appeared in live-action movies of the past.
The physical relationships between objects and causal relationships over time are not yet clearly depicted. This means that as the number of elements in the video increases and becomes more complex, the more awkward parts appear.
Sora converts videos and images into small pieces of data called patches. By integrating ways to represent visual data, just like bytes used in computers, we can learn and expand a wider range. Understand and express text prompts based on visual training information accumulated from past research on DALL·E and GPT models. In addition to text, you can also create videos or expand videos using still images.
We also recently shared measures to prevent harmful effects such as deepfakes caused by AI. To prevent extreme expressions, the company limited text input prompts and developed an image classifier that reviews frames to ensure they comply with usage policies. What will be censored and what will be passed? It feels similar to the time when SNS spread and everyone became media.