ICML Poster Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Poster

Over-Tokenized Transformer: Vocabulary is Generally Worth Scaling

Hongzhi Huang · Defa Zhu · Banggu Wu · Zeng · Ya Wang · Qiyang Min · zhou Xun

[ Abstract ]

Wed 16 Jul 4:30 p.m. PDT — 7 p.m. PDT

Abstract:

Tokenization is a fundamental component of large language models (LLMs), yet its influence on model scaling and performance remains underexplored. In this paper, we introduce Over-Tokenized Transformers, a novel framework that decouples input and output vocabularies to improve language modeling performance. Specifically, our approach expands input vocabularies to leverage multi-gram tokens. Through extensive experiments, we uncover a log-linear relationship between input vocabulary size and training loss, demonstrating that larger input vocabularies consistently enhance model performance, no matter model sizes. Our findings establish tokenization as a critical factor in scaling laws and offer new insights into tokenizer design, paving the way for more efficient and powerful LLMs.

Live content is unavailable. Log in and register to view live content