Tokenization (2)

Token-free LMs enable faster and fairer NLP

Token-free language models that operate on raw bytes instead of subword tokens can remove tokenization bias, improve computational efficiency, and enable faster text generation.

Do Text Embeddings Truly Understand Meaning or Just Tokenize?

The key takeaway is that embeddings are vector representations that capture the semantics and meaning of the text, going beyond just tokenization. The embedding process squeezes the text through the model to understand it and make predictions, thus encoding semantic relationships. Better data quality can improve embeddings but normalization is…