Text Embeddings (3/3) - Conclusion

June 24, 2022 2 minute read

Introduction

This is the third of three articles on text embeddings for beginners:

Text Embeddings (1/3) - Explanation
Text Embeddings (2/3): Computation
Text Embeddings (3/3) - Conclusion (you are here)

Here you may find my concluding thoughts on text embeddings.

Pros and Cons of Text Embeddings

Pros

Text embeddings capture the true semantic relationship between texts. This works much better than any of keywords search or words statistics comparison. This provides the possibility to search for similar texts and to compare texts.
Feature Extraction. Text embeddings serve as effective features for other traditional machine learning algorithms for text classification, clustering, etc.

Cons

Text embedding methods I’ve studied so far all struggle to understand negation. For example: I like coffee and I do not like coffee can produce text embeddings, which are located very close to each other in the embeddings space, despite having an opposite meaning.
The number of dimensions of embeddings is fixed. We still do not know what is the optimal dimensionality to describe the world, nor how to add new dimensions to already pre-trained word embeddings, if it is not enough. A typical NLP system lacks any tools to change embeddings dimensionality.
Text embeddings are fixed. Because they are produced by a model with fixed weights on the base of word embeddings, which are also pre-trained. A typical NLP system lacks any tools to adjust model weights to new information. This is different from how human brain is changing when it grasps new ideas and concepts.
Thus text embeddings may not perform well on some specific tasks, as they are trained on diverse datasets and might not capture the nuances of certain domains.
Poor interpretability. Interpreting the meaning of individual dimensions in an embedding space can be challenging, making it difficult to understand why certain comparisons yield specific results.

Conclusion

Text embeddings revolutionized text comparison and processing. They represent entire documents as vectors in a continuous vector space. Which makes it possible to compute similarity of texts, find “mean value” of two texts and do other vector tricks.

Yet I believe this is not the end stop, as there are ways for improvement. A better and adjustable solution is yet to be discovered.

From the practical point of view, embeddings computed with Sentence Transformers and OpenAI API methods produce the best results. Yet, as of today, the embeddings from Sentence Transformers are slightly better.

If you decide what method to choose to compute text embeddings, I would recommend the Sentence Transformers method.

its models, typically, run very fast;
they can even be used without GPU (CPU only);
they are free to use for commercial purpose;
embeddings they produce are very good for text comparison.

You can choose a particular Sentence Transformer model using this benchmark.

Links and Resources

The source code for this article is freely available here: text-embeddings.ipynb

Twitter Facebook LinkedIn

Pavel B. Chernov

Text Embeddings (3/3) - Conclusion

Introduction

Pros and Cons of Text Embeddings

Pros

Cons

Conclusion

Links and Resources

You May Also Enjoy

Introduction to Online Ads Industry

Text Embeddings (2/3) - Computation

Text Embeddings (1/3) - Explanation

My Solution to Agro Code Contest