PerceiverIO Paper
#work/patientsim 2024-02-05
# on NLP
They seem to pretty much match transformers, flop-for-flop.
We first compare Perceiver IO to standard Transformers for language. Although Transformers were originally developed for language, their quadratic complexity makes them difficult to use on language inputs without tokenization, which typically shortens the length of input sequences by a factor of ā¼4. But unlike Transformer-based models such as BERTĀ (Devlin etĀ al., 2019) or XLNetĀ (Yang etĀ al., 2019), Perceiver IO scales linearly with input length. Our experiments focus on showing that Perceiver IO performs as well as or better than Transformers for masked language modeling (MLM) while removing tokenization (which is hard to maintain, introduces engineering overhead, and adds needless complexity to language modelsĀ (Bostrom & Durrett, 2020; Clark etĀ al., 2022)). We compare results for a given FLOPs budget rather than a given parameter budget as the former grows quadratically with sequence length but the latter is independent (except for positional encodings). From a practionerās perspective, FLOPs matter more than parameters since FLOPs directly relate to training time. We evaluate the quality of the learned representation on the GLUE benchmarkĀ (Wang etĀ al., 2019) and report our results in Tab.Ā 1. We find that at a given FLOPs budget, Perceiver IO trained without tokenization matches the performance of a strong Transformer-based model trained with SentencePiece tokenizationĀ (Sennrich etĀ al., 2016; Kudo & Richardson, 2018).