Token-free LMs enable faster and fairer NLP
Token-free language models that operate on raw bytes instead of subword tokens can remove tokenization bias, improve computational efficiency, and enable faster text generation.
Summary
- The paper introduces MambaByte, a token-free selective state space model (SSM) that operates on raw byte sequences instead of subword tokens.
- Current large language models use subword tokenization which has issues like lack of robustness to typos, spelling variations, etc. Tokenization also makes non-English languages much more expensive to model.
- Operating on raw bytes removes tokenization bias, treats all languages equally, and allows longer context modeling. However, auto-regressive transformers don't scale well to long byte sequences.
- MambaByte combines the benefits of SSMs like fast inference speed and efficient modeling of long sequences, with the accuracy of auto-regressive modeling on raw bytes.
- Experiments show MambaByte requires fewer compute resources and outperforms transformers on metrics like bits-per-byte and perplexity on various datasets.
- MambaByte also enables significantly faster text generation compared to transformers.
- The paper establishes the promise of token-free language modeling, which can remove tokenization bias and improve efficiency.