Text Embeddings (3/3) - Conclusion

2 minute read

Introduction

This is the third of three articles on text embeddings for beginners:

  1. Text Embeddings (1/3) - Explanation
  2. Text Embeddings (2/3): Computation
  3. Text Embeddings (3/3) - Conclusion (you are here)

Here you may find my concluding thoughts on text embeddings.

Pros and Cons of Text Embeddings

Pros

  1. Text embeddings capture the true semantic relationship between texts. This works much better than any of keywords search or words statistics comparison. This provides the possibility to search for similar texts and to compare texts.

  2. Feature Extraction. Text embeddings serve as effective features for other traditional machine learning algorithms for text classification, clustering, etc.

Cons

  1. Text embedding methods I’ve studied so far all struggle to understand negation. For example: I like coffee and I do not like coffee can produce text embeddings, which are located very close to each other in the embeddings space, despite having an opposite meaning.

  2. The number of dimensions of embeddings is fixed. We still do not know what is the optimal dimensionality to describe the world, nor how to add new dimensions to already pre-trained word embeddings, if it is not enough. A typical NLP system lacks any tools to change embeddings dimensionality.

  3. Text embeddings are fixed. Because they are produced by a model with fixed weights on the base of word embeddings, which are also pre-trained. A typical NLP system lacks any tools to adjust model weights to new information. This is different from how human brain is changing when it grasps new ideas and concepts.

  4. Thus text embeddings may not perform well on some specific tasks, as they are trained on diverse datasets and might not capture the nuances of certain domains.

  5. Poor interpretability. Interpreting the meaning of individual dimensions in an embedding space can be challenging, making it difficult to understand why certain comparisons yield specific results.


Conclusion

Text embeddings revolutionized text comparison and processing. They represent entire documents as vectors in a continuous vector space. Which makes it possible to compute similarity of texts, find “mean value” of two texts and do other vector tricks.

Yet I believe this is not the end stop, as there are ways for improvement. A better and adjustable solution is yet to be discovered.

From the practical point of view, embeddings computed with Sentence Transformers and OpenAI API methods produce the best results. Yet, as of today, the embeddings from Sentence Transformers are slightly better.

If you decide what method to choose to compute text embeddings, I would recommend the Sentence Transformers method.

  • its models, typically, run very fast;
  • they can even be used without GPU (CPU only);
  • they are free to use for commercial purpose;
  • embeddings they produce are very good for text comparison.

You can choose a particular Sentence Transformer model using this benchmark.


The source code for this article is freely available here: text-embeddings.ipynb

  1. Key Word in Context (KWIC)
  2. Vector Space Models (VSM)
  3. Term Frequency-Inverse Document Frequency (TF-IDF)
  4. Latent Semantic Analysis (LSA)
  5. Word2Vec Wiki
  6. GloVe Wiki
  7. Transformer Wiki
  8. Sentence-BERT: Sentence Embeddings using Siamese BERT-Networks
  9. Natural Language Inference (NLI) Wiki
  10. HuggingFace - Text Classification)
  11. Sentence Transformers Library
  12. Reddit: compare openai and Sentence Transformer
  13. Some benchmark table of different models
  14. Private opinion on opeanai embeddings
  15. Embeddings - OpenAI API