Perceiver IO Statistics [User Trends In 2026]

January 15, 2026

By

About Chromeboooks

DeepMind’s Perceiver IO achieved state-of-the-art optical flow results with an average end-point error of 2.42 on Sintel.final benchmarks while processing inputs linearly rather than quadratically like standard transformers. The architecture contains 201 million parameters and handles up to 2,048 bytes of input through 26 processing layers. With 81.8 on the GLUE benchmark using raw byte inputs, Perceiver IO matches BERT performance without tokenization preprocessing.

Perceiver IO Key Statistics

Perceiver IO contains 201 million parameters with 256 to 512 latent variables processing information through 26 layers as of 2024.
The model scored 81.8 on the GLUE benchmark with byte-level inputs, surpassing BERT Base’s 81.1 score without requiring tokenization.
Perceiver IO achieved state-of-the-art optical flow accuracy with 1.81 average end-point error on Sintel.clean and 2.42 on Sintel.final.
Hugging Face hosts 36 Perceiver models including 7 official DeepMind checkpoints with over 2,520 language model downloads and 1,740 vision model downloads.
The architecture scales linearly with input and output sizes compared to quadratic scaling in standard transformers, enabling processing of 50,000+ pixel inputs efficiently.

Perceiver IO Architecture and Parameters

The model employs a latent bottleneck design that decouples computational requirements from data dimensionality. Cross-attention mechanisms enable linear scaling with input and output sizes.

Architecture Component	Specification
Total Parameters	201 million
Latent Variables	256 to 512
Processing Layers	26 layers
Vocabulary Size	262 tokens (byte-level)
Maximum Input Sequence	2,048 bytes
Input Processing Capability	50,000+ pixels

The 26 processing layers represent more than double BERT Base’s 12 layers. The reduced latent size of 256 keeps computation tractable while maintaining performance across multiple domains.

Perceiver IO Language Understanding Performance

Natural language understanding capabilities demonstrate competitive performance on the General Language Understanding Evaluation benchmark. The model eliminates traditional tokenization requirements while maintaining accuracy.

Configuration	GLUE Score	Input Type
Perceiver IO (High FLOPs)	81.8	UTF-8 Bytes
Perceiver IO (SentencePiece)	81.2	Tokenized
BERT Base	81.1	Tokenized

The benchmark results show Perceiver IO matches BERT performance while processing raw byte inputs. This eliminates engineering overhead and vocabulary maintenance requirements.

Perceiver IO Image Classification Results

Visual understanding capabilities process images without relying on specialized 2D convolutional architectures. The model learns spatial relationships from data alone.

Variant	ImageNet Top-1 Accuracy	Preprocessing Method
Conv+MaxPool Preprocessing	84.5%	2D Convolution
2D Fourier Features	79.0%	Fourier Encoding
Learned 1D Position	72.7%	No 2D Information

The learned 1D position variant achieves 72.7% accuracy despite receiving no information about 2D image structure. The conv+maxpool variant reaches 84.5% after large-scale pretraining on JFT.

Perceiver IO Optical Flow Benchmarks

Optical flow estimation presents a computer vision challenge where the model achieved state-of-the-art results. The architecture predicts 2D displacement for each pixel between consecutive video frames.

Benchmark Dataset	Average End-Point Error	Performance Ranking
Sintel.clean	1.81	State-of-the-Art
Sintel.final	2.42	Best Overall
KITTI	4.98	Competitive

The model achieved state-of-the-art results on Sintel.final without cost volumes or explicit warping mechanisms. Training occurred on AutoFlow, a synthetic dataset with 400,000 annotated image pairs.

Perceiver IO Computational Efficiency

Computational complexity metrics differentiate the architecture from standard transformers. Linear scaling with sequence length enables processing of longer inputs efficiently.

Perceiver IO scales linearly with both input and output sizes compared to quadratic scaling in standard transformers. The latent bottleneck ensures self-attention computation remains independent of input dimensionality.

The architecture processes inputs approximately 4 times longer than BERT when comparing bytes to tokens. The bulk of processing occurs in the compressed latent space where N is much smaller than M.

Perceiver IO Research Impact and Platform Adoption

Developer adoption metrics indicate practical utility across research and production environments. The model family maintains active presence on the Hugging Face platform with 36 total models.

Platform Metric	Count
Total Models on Hugging Face	36
Official DeepMind Models	7
Language Model Downloads	2,520+
Vision Model Downloads	1,740+
Supported Task Types	6+

The original paper appeared at ICLR 2022 with an arXiv release on July 30, 2021. Research extensions include Graph Perceiver IO published in February 2025 and stress detection applications in July 2025.

Perceiver IO Multimodal Processing Capabilities

Multimodal autoencoding capabilities distinguish the architecture from single-domain models. The system simultaneously processes video frames, audio samples, and classification labels within a unified framework.

The model handles 16 frames at 224×224 resolution alongside 30,720 audio samples for 700 classification classes in the Kinetics-700 dataset. Inputs receive modality-specific embeddings and serialize into a 2D input array.

When the class label is masked during evaluation, the autoencoding model functions as a video classifier. This demonstrates architectural flexibility across diverse machine learning tasks without domain-specific modifications.

FAQ

What is the total parameter count for Perceiver IO language models?

The Perceiver IO language model contains 201 million parameters when configured for UTF-8 byte tokenization with a vocabulary size of 262 tokens.

How does Perceiver IO compare to BERT on the GLUE benchmark?

Perceiver IO achieves 81.8 on the GLUE benchmark with byte-level inputs, outperforming BERT Base’s 81.1 score while eliminating tokenization preprocessing requirements.

What optical flow accuracy does Perceiver IO achieve?

Perceiver IO achieves an average end-point error of 1.81 on Sintel.clean and 2.42 on Sintel.final, representing state-of-the-art performance on the Sintel.final benchmark.

How many Perceiver models are available on Hugging Face?

Hugging Face hosts 36 Perceiver models including 7 official DeepMind checkpoints covering language, vision, optical flow, and multimodal applications with over 4,260 total downloads.

What is the computational complexity of Perceiver IO?

Perceiver IO scales linearly with both input and output sizes compared to quadratic scaling in standard transformers, enabling efficient processing of 50,000+ pixel inputs.

Sources

Perceiver IO: A General Architecture for Structured Inputs & Outputs

OpenReview – Perceiver IO ICLR 2022

Hugging Face Perceiver Documentation

ScienceDirect – Perceiver Applications Research