Jan 25, 2024

Do Text Embeddings Truly Understand Meaning or Just Tokenize?

The key takeaway is that embeddings are vector representations that capture the semantics and meaning of the text, going beyond just tokenization. The embedding process squeezes the text through the model to understand it and make predictions, thus encoding semantic relationships. Better data quality can improve embeddings but normalization is also important.

Summary

Embeddings are vectors capturing semantics of text, not just tokenized words.
The embedding process runs the text through the model to understand and make predictions. This infuses the vector with semantic meaning.
Vector similarities show semantic relationships between text chunks.
Better/normalized data can improve embedding quality by reducing noise. But some variation doesn't affect meaning much.
The model has limited attention span. Pertinent concepts at start and end are prioritized. There can be a blackout in the middle.
Cross-referencing concepts over longer distances doesn't seem to work well.
For retrieval, focus more on indexing entities, events, people etc. upfront rather than just embedding quality.
It's unclear if embeddings alone can connect entities to their references in text.