Perceiver: General Perception with Iterative Attention (Google DeepMind Research Paper Explained)

3 years ago
9

#perceiver​ #deepmind​ #transformer​

Inspired by the fact that biological creatures attend to multiple modalities at the same time, DeepMind releases its new Perceiver model. Based on the Transformer architecture, the Perceiver makes no assumptions on the modality of the input data and also solves the long-standing quadratic bottleneck problem. This is achieved by having a latent low-dimensional Transformer, where the input data is fed multiple times via cross-attention. The Perceiver's weights can also be shared across layers, making it very similar to an RNN. Perceivers achieve competitive performance on ImageNet and state-of-the-art on other modalities, all while making no architectural adjustments to input data.

OUTLINE:
0:00​ - Intro & Overview
2:20​ - Built-In assumptions of Computer Vision Models
5:10​ - The Quadratic Bottleneck of Transformers
8:00​ - Cross-Attention in Transformers
10:45​ - The Perceiver Model Architecture & Learned Queries
20:05​ - Positional Encodings via Fourier Features
23:25​ - Experimental Results & Attention Maps
29:05​ - Comments & Conclusion

Paper: https://arxiv.org/abs/2103.03206​

My Video on Transformers (Attention is All You Need): https://youtu.be/iDulhoQ2pro​

Abstract:
Biological systems understand the world by simultaneously processing high-dimensional inputs from modalities as diverse as vision, audition, touch, proprioception, etc. The perception models used in deep learning on the other hand are designed for individual modalities, often relying on domain-specific assumptions such as the local grid structures exploited by virtually all existing vision models. These priors introduce helpful inductive biases, but also lock models to individual modalities. In this paper we introduce the Perceiver - a model that builds upon Transformers and hence makes few architectural assumptions about the relationship between its inputs, but that also scales to hundreds of thousands of inputs, like ConvNets. The model leverages an asymmetric attention mechanism to iteratively distill inputs into a tight latent bottleneck, allowing it to scale to handle very large inputs. We show that this architecture performs competitively or beyond strong, specialized models on classification tasks across various modalities: images, point clouds, audio, video and video+audio. The Perceiver obtains performance comparable to ResNet-50 on ImageNet without convolutions and by directly attending to 50,000 pixels. It also surpasses state-of-the-art results for all modalities in AudioSet.

Authors: Andrew Jaegle, Felix Gimeno, Andrew Brock, Andrew Zisserman, Oriol Vinyals, Joao Carreira

Links:
TabNine Code Completion (Referral): http://bit.ly/tabnine-yannick​
YouTube: https://www.youtube.com/c/yannickilcher​
Twitter: https://twitter.com/ykilcher​
Discord: https://discord.gg/4H8xxDF​
BitChute: https://www.bitchute.com/channel/yann...​
Minds: https://www.minds.com/ykilcher​
Parler: https://parler.com/profile/YannicKilcher​
LinkedIn: https://www.linkedin.com/in/yannic-ki...​
BiliBili: https://space.bilibili.com/1824646584​

If you want to support me, the best thing to do is to share out the content :)

If you want to support me financially (completely optional and voluntary, but a lot of people have asked for this):
SubscribeStar: https://www.subscribestar.com/yannick...​
Patreon: https://www.patreon.com/yannickilcher​
Bitcoin (BTC): bc1q49lsw3q325tr58ygf8sudx2dqfguclvngvy2cq
Ethereum (ETH): 0x7ad3513E3B8f66799f507Aa7874b1B0eBC7F85e2
Litecoin (LTC): LQW2TRyKYetVC8WjFkhpPhtpbDM4Vw7r9m
Monero (XMR): 4ACL8AGrEo5hAir8A9CeVrW8pEauWvnp1WnSDZxW7tziCDLhZAGsgzhRQABDnFy8yuM9fWJDviJPHKRjV4FWt19CJZN9D4n

Loading comments...