Users can generate videos up to 1080p resolution, up to 20 sec long, and in widescreen, vertical or square aspect ratios. You can bring your own assets to extend, remix, and blend, or generate entirely new content from text.
1 Follower
Weāve discovered neurons in CLIP that respond to the same concept whether presented literally, symbolically, or conceptually. This may explain CLIPās accuracy in classifying surprising visual renditions of concepts, and is also an important step toward understanding the associations and biases that CLIP and similar models learn. Fifteen years ago, Quiroga et al.1 discovered that the human brain possesses multimodal neurons. These neurons respond to clusters of abstract concepts centered around a common high-level theme, rather than any specific visual feature. The most famous of these was the āHalle Berryā neuron, a neuron featured in both Scientific Americanā (opens in a new window) and The New York Timesā (opens in a new window), that responds to photographs, sketches, and the text āHalle Berryā (but not other names). Two months ago, OpenAI announced CLIPā , a general-purpose vision system that matches the performance of a ResNet-50,2 but outperforms existing vision systems on some of the most challenging datasets. Each of these challenge datasets, ObjectNet, ImageNet Rendition, and ImageNet Sketch, stress tests the modelās robustness to not recognizing not just simple distortions or changes in lighting or pose, but also to complete abstraction and reconstructionāsketches, cartoons, and even statues of the objects. Now, weāre releasing our discovery of the presence of multimodal neurons in CLIP. One such neuron, for example, is a āSpider-Manā neuron (bearing a remarkable resemblance to the āHalle Berryā neuron) that responds to an image of a spider, an image of the text āspider,ā and the comic book character āSpider-Manā either in costume or illustrated. Our discovery of multimodal neurons in CLIP gives us a clue as to what may be a common mechanism of both synthetic and natural vision systemsāabstraction. We discover that the highest layers of CLIP organize images as a loose semantic collection of ideas, providing a simple explanation for both the modelās versatility and the representationās compactness.