CodeT5 User Statistics [2026 Updated]

CodeT5 User Statistics [2026 Updated]

Salesforce Research’s CodeT5 reached 22,172 monthly downloads on Hugging Face as of December 2025, establishing itself as a leading open-source code intelligence model. The encoder-decoder transformer family spans from 60 million to 16 billion parameters, with the InstructCodeT5+ 16B variant achieving 35.0% pass@1 on the HumanEval benchmark. CodeT5 processes 51.5 billion tokens during pre-training and supports nine programming languages, generating over 1,500 research citations across NLP and software engineering communities.

CodeT5 Statistics: Key Highlights

  • CodeT5 recorded 3,100+ GitHub stars and 487 forks as of December 2025, reflecting strong developer engagement with the open-source codebase.
  • The model family ranges from CodeT5-small at 60 million parameters to CodeT5+ 16B at 16 billion parameters, offering a 267x scale difference for diverse deployment scenarios.
  • InstructCodeT5+ 16B with CodeT augmentation achieved 42.9% pass@1 on HumanEval, outperforming OpenAI’s code-cushman-001 among open-source alternatives at evaluation time.
  • Community developers created 86 fine-tuned CodeT5 variants on Hugging Face for specialized tasks including vulnerability detection and code review automation.
  • CodeT5 training consumed 49.25 kg CO2 for the base variant, fully offset through cloud provider carbon credits to maintain environmental responsibility.

CodeT5 Model Architecture and Parameter Scale

The CodeT5 family builds on the T5 encoder-decoder architecture with code-specific enhancements. Salesforce released six primary variants to accommodate computational constraints across edge devices, standard workstations, and cloud infrastructure.

Parameter counts span three orders of magnitude, from lightweight deployment models to state-of-the-art generation systems. The architecture supports flexible operation modes including encoder-only for code understanding tasks and full encoder-decoder for generation workflows.

Model Variant Parameters Primary Application
CodeT5-small 60 million Lightweight deployment
CodeT5-base 220 million Standard code tasks
CodeT5-large 770 million Enhanced generation
CodeT5+ 2B 2 billion Advanced understanding
CodeT5+ 6B 6 billion Complex code synthesis
CodeT5+ 16B 16 billion State-of-the-art generation

CodeT5+ 16B delivers a 72x parameter increase over CodeT5-base, enabling capture of nuanced code semantics and higher-quality outputs across programming paradigms.

CodeT5 Training Dataset Composition

Pre-training data composition directly impacts downstream task performance across code intelligence applications. CodeT5 leverages the CodeSearchNet dataset alongside supplementary repositories to establish multilingual code representations.

The original CodeT5 processed 8.35 million training instances, while CodeT5+ expanded to 51.5 billion tokens—approximately 50 times more training data. Training utilized permissively licensed repositories filtered by MIT, Apache-2.0, BSD-3-Clause, BSD-2-Clause, CC0-1.0, Unlicense, and ISC licenses.

Training Metric CodeT5 Original CodeT5+
Training instances 8.35 million
Total tokens 51.5 billion
Programming languages 8 9
GPU configuration 16x NVIDIA A100 16x NVIDIA A100
Vocabulary size 32,000 tokens 32,000 tokens

CodeT5-base required 12 days of training time on 16 NVIDIA A100 GPUs, while CodeT5-small completed pre-training in 5 days on the same hardware configuration.

Programming Language Coverage

CodeT5 models deliver multilingual code intelligence spanning interpreted and compiled paradigms. The identifier-aware pre-training achieved over 99% F1 score for identifier tagging across all supported languages.

CodeT5+ added C++ support to address systems programming requirements, expanding from eight languages in the original release to nine in the enhanced variant.

CodeT5 Benchmark Performance Results

The HumanEval benchmark evaluates functional correctness through unit test passage rather than surface-level similarity metrics. CodeT5+ variants demonstrate scaling benefits with larger parameter counts.

InstructCodeT5+ 16B reached 35.0% pass@1 without augmentation, surpassing OpenAI’s code-cushman-001 among open-source code LLMs at evaluation time. The CodeT generation strategy pushed pass@1 to 42.9% through test case generation.

Model Configuration pass@1 pass@10
CodeT5+ 220M 12.1% 20.4%
CodeT5+ 770M 15.5% 27.8%
CodeT5+ 2B 24.2% 38.5%
CodeT5+ 6B 28.6% 45.3%
InstructCodeT5+ 16B 35.0% 54.5%
InstructCodeT5+ 16B + CodeT 42.9% 67.8%

Code Summarization Metrics

Code summarization generates natural language descriptions from source code functions for automated documentation workflows. CodeT5 established state-of-the-art BLEU-4 scores across six programming languages from CodeSearchNet.

Ruby demonstrated the largest relative improvement at 9.7% over previous best results, while PHP showed the smallest gain at 3.5%. Average improvement across all evaluated languages exceeded 7%.

CodeT5 Community Adoption Metrics

GitHub and Hugging Face engagement metrics reflect practical developer adoption and experimentation rates. The repository accumulated 3,100+ stars and 487 forks as of December 2025.

Monthly Hugging Face downloads reached 22,172 in December 2025, indicating sustained interest in the model family. Community developers created 86 fine-tuned variants spanning vulnerability detection, code review automation, and specialized summarization tasks.

Platform Metric Value Last Updated
GitHub stars 3,100+ December 2025
GitHub forks 487 December 2025
Hugging Face downloads (monthly) 22,172 December 2025
Fine-tuned model variants 86 December 2025
Hugging Face Spaces 36+ December 2025

The 36+ Hugging Face Spaces using CodeT5 demonstrate integration into interactive demos and production applications across code generation, documentation, and analysis workflows.

CodeT5 Task Performance Improvements

CodeT5+ delivers quantifiable gains across code understanding and generation benchmarks compared to prior state-of-the-art baselines. Text-to-code retrieval improved by 3.2 MRR points across eight evaluation tasks.

Line-level code completion showed 2.1 point gains in average exact match scores across two benchmark tasks. Retrieval-augmented generation recorded the largest improvement at 5.8 BLEU-4 points across two evaluation datasets.

Task Category Metric Improvement
Text-to-code retrieval Average MRR +3.2 points
Line-level completion Average Exact Match +2.1 points
Retrieval-augmented generation Average BLEU-4 +5.8 points
MathQA-Python pass@80 87.4% (new SOTA)
GSM8K-Python pass@100 73.8%

Math programming benchmarks demonstrate that CodeT5+ models below billion-parameter scale outperform alternatives with up to 137 billion parameters, highlighting encoder-decoder architecture efficiency for mathematical reasoning tasks.

Environmental Impact and Carbon Footprint

Salesforce documented computational costs and carbon emissions for model pre-training to promote transparency in AI development practices. CodeT5-base training produced 49.25 kg CO2 on 16 NVIDIA A100 GPUs.

Google Cloud Platform’s carbon credit program fully offset emissions from pre-training. Public release of pre-trained checkpoints eliminates the need for community members to repeat computationally expensive training procedures.

Environmental Metric CodeT5-base Value
Carbon emissions 49.25 kg CO2
Training hardware 16x NVIDIA A100 (40GB)
Training epochs (denoising) 100
Training epochs (bimodal) 50
Cloud provider offset 100% carbon credits

Research Citations and Academic Impact

Academic adoption provides insight into CodeT5’s influence on subsequent code intelligence research. The model family generated over 1,500 arXiv citations as of December 2025.

The CodeT5 research lineage spans three major conference publications: EMNLP 2021 for the original paper, NeurIPS 2022 for CodeRL building on CodeT5, and EMNLP 2023 for CodeT5+ presentation.

These publications established methodological foundations for identifier-aware pre-training, reinforcement learning for code generation, and flexible encoder-decoder architectures for code LLMs.

FAQ

How many parameters does CodeT5 have?

CodeT5 models range from 60 million parameters in CodeT5-small to 16 billion parameters in CodeT5+ 16B. The base variant contains 220 million parameters for standard code tasks, while CodeT5-large has 770 million parameters for enhanced generation capabilities.

What programming languages does CodeT5 support?

CodeT5 supports nine programming languages: Python, Java, JavaScript, PHP, Ruby, Go, C, C#, and C++. The original version covered eight languages, with CodeT5+ adding C++ support for systems programming applications.

How does CodeT5 perform on HumanEval benchmark?

InstructCodeT5+ 16B achieves 35.0% pass@1 on HumanEval without augmentation. With CodeT generation strategy, performance increases to 42.9% pass@1 and 67.8% pass@10, surpassing code-cushman-001 among open-source models at evaluation time.

How many downloads does CodeT5 receive monthly?

CodeT5 recorded 22,172 monthly downloads on Hugging Face as of December 2025. The model also accumulated 3,100+ GitHub stars and 487 forks, with 86 fine-tuned variants created by community developers for specialized applications.

What training data does CodeT5 use?

CodeT5 processes 8.35 million training instances from the CodeSearchNet dataset, while CodeT5+ expands to 51.5 billion tokens. Training uses permissively licensed code filtered by MIT, Apache-2.0, BSD, CC0-1.0, Unlicense, and ISC licenses for commercial compliance.