Although it is most prominently featured in the Large Language Models such as GPT, the transformer architecture can be adapted to any structured data. It has become clear that if the data can be structured to work in this architecture, there will likely be performance gains compared to non-transformer based approaches. The biggest limiting factor, however, is the size of the input that can be used in transformers. Researcher’s at Deepmind have been developing an approach over a few papers that highlight a new, general transformer architecture to avoid this issue, which they call the Perceiver. Read on to learn more.
The Attention Mechanism: As Data Inputs Become Larger Compute-Time Explodes
Transformers, while a powerful tool for artificial intelligence (AI) learning, unfortunately can face an intractable problem when it comes to large input sizes or multi-modal data. In general, the larger the input or context, the more nuanced the predictions can be. This is a direct consequence of the attention mechanism within the model, which was the motivation and main contribution of the original transformer paper, and shown to be crucial for its performance gains1. This attention mechanism, however, generally comes with a cost. The compute time in the model grows as the square of the input size, sometimes called the quadratic bottleneck of the transformer. While this is less of a problem in language domains where input sizes are on the order of 1,000’s of tokens, it quickly becomes intractable in domains with large or multi-dimensional inputs such as images, audio, and videos where input sizes approach hundreds of thousands of tokens. Similarly, having multimodal or multi-domain data, such as text paired with images, is problematic for the same reasons.
Using Cross-Attention to Overcome Transformers Architecture Limitations
Researcher’s at Deepmind have been developing an approach over a few papers that highlight a new, general transformer architecture to avoid this issue, which they call the Perceiver2,3,4. The idea behind it is relatively straightforward, and actually not really new, but they apply it in a very neat and complete way and use it to clearly define a way forward for achieving much higher input sizes in transformers.
The key to their approach is the use of cross-attention as opposed to self-attention. Self-attention, which is ubiquitously used in modern transformers, is the mechanism that leads to the dreaded quadratic bottleneck as it allows for full attention from all input tokens to all other input tokens. Cross-attention historically is simply a window of attention from input space onto a separate input space. This is generally used when connecting encoder and decoder branches of transformer based language-translation models where the encoder and decoder may have a different number of tokens present as they are in different languages. In Perceiver, with a bit of overloading of terms, cross-attention is being used to reduce the amount of self-attention to apply.
Perceiver Architecture Scales Large Input Sizes by Learning Latent Queries
As can be seen in the figure above, attention modules require (K)eys, (Q)ueries, and (V)alues matrices as input, which themselves are just learned transforms on the original input, and is generally calculated as:
\[Attenion = softmax(QK^T)V\]
As is the case with normal transformers, if the input size is M tokens long and each token is C channels, then the attention mechanism will end up multiplying an (M x M) sized matrix with an (M x C) matrix together, leading to the problematic quadratic complexity on the input size. Now if you’re trying to use very large input sizes this becomes intractable. The Perceiver architecture solves the problem of large input sizes (M) by instead learning a smaller Q matrix (N x D) leading to an attention compute complexity of N*M, or linear in the size of the input size M. Here, N is fixed and can be chosen to make the compute-time feasible given the size of M. Below is a plot showing the difference in compute time for standard Transformer attention and Perceiver attention for various context sizes.
Below I show a plot illustrating how poorly compute-time scales in standard Transformers. I was only able to calculate Transformer compute-time up to a context size of about 10,000 before running out of RAM on a standard Google Colab node, while the Perceiver context was calculated up to about 10 million tokens before RAM issues. This suggests about a 1000x increase in maximum context size when using a Perceiver architecture!
The authors describe the smaller Q in this case as a latent representation of the input or the current state of the model, similar to the use of latent representations in auto-encoder models. Essentially it is a compression or clustering of the input into a smaller matrix or subspace, which hopefully doesn’t destroy the essential details of the original input space. This results in an attention mapping from input tokens to some latent/compressed/clustered version of those input tokens, which then gets applied to the current latent representation of the model, again a learned compression of the original input space.
Over the course of at least three papers Deepmind authors applied this architecture very broadly (including on images, audio, music, text, as well as multi-modal and multi-task problems) with notable success including some state of the art results. What’s critical, as shown in the figure above, compute-time will scale with the size of the smaller latent representation. Similarly, there is a trade-off in performance between the size of the latent representation and the input or context length. With the Perceiver architecture this trade-off is a tunable parameter allowing for an empirically demonstrated optimal balance between the two.
Leveraging Perceiver for AAV Constructs: Pairing Complex, High Dimensional Companion Data with DNA Sequences
There are many other interesting aspects of the Perceiver model that can be addressed in future write-ups, including a recurrent-like use of transformers, configurable compute costs at inference, and Nerf-like query vectors for specific information during decoding, However the main implications for this new model is that we can use it to begin approaching problems with a context size on the order of 100,000 input tokens within high-performance transformer architectures. This allows for modeling whole AAV constructs as a single input, for example.
Also, we can start pairing complex, high dimensional companion data with our standard nucleotide based inputs. This could include tertiary folding patterns, text-based annotations on function, enhancer/promoter networks, bio-reactor and manufacturing parameters. This broader context can open up transformer approaches to problem domains that were previously intractable.
Want to stay apprised of all AI advances and how it can be applied to gene therapy development programs?
Subscribe to our newsletterReferences
- Polosukhin, I et al. Attention is all you need. Arxiv. Submitted on 12 Jun 2017 (v1), last revised 6 Dec 2017 (this version, v5). Accessed March 6, 2023.
- Carreira J. et al. arXiv. Perceiver: General Perception with Iterative Attention.Submitted on 4 Mar 2021 (v1), last revised 23 June 2021 (this version, v2). Accessed.
- Engel, J et al. arXiv. General-purpose, long-context autoregressive modeling with Perceiver AR. Submitted on 15 Feb 2022 (v1), last revised 14 Jun 2022 (this version, v2). Accessed March 6, 2023.
- J et al. arXiv. Perceiver IO: A General Architecture for Structured Inputs & Outputs. Submitted on 30 Jul 2021 (v1), last revised 15 Mar 2022 (this version, v3). Accessed March 6, 2023.
- Quadradtic_compare.ipynb