Have you heard about embedding? If the answer is ‘no’, then this post is for you. Simply put, AI without embedding is like a Ferrari with a wheel clamp. It cannot live to its full potential. Now, let's dig deeper and find out why.
GPT-4 (and most models) have a context limit. If you run out of this limit, you must rephrase your input or summarize previous conversations so GPT-4 knows what you are speaking about. There is prompt compression for that, but let's leave it for another post. On the other hand, embedding opens multiple doors that were closed in the past. From classification and clustering to semantic searching and recommendation algorithms, embedding has revolutionized the way we handle and interpret data.
It eliminates almost all problems you would face in the above use cases. In short, embedding is the transformation of information into a vector space where similar pieces of data are closer to each other. I know this sentence sounds a bit frightening at first, but I promise that you will clearly understand what I'm talking about by the end of this post.
Imagine looking at a hypermarket from above. Let's say you want to find some fruits. You are a frequent customer here, so you will instantly go to the upper left corner since they are always there. Bakery products will also be close, maybe one or two rows away. This makes sense because hypermarkets usually store related products like food close to each other. But any computer part will be located on the opposite side. And in the third direction, you will find the clothes section.
These are just examples I made up, but you get the idea. You can consider the above example as a 2D vector space with X and Y coordinates. In this space, everything related is close to each other. You will more or less instantly know where to look for something. Since hypermarkets are built in the 3D world, and each of their shelves consists of multiple rows above each other, we quickly extend our example to a 3D vector space with X, Y, and Z coordinates.
Even in this 3D space, related products remain in proximity. We can increase the dimension count if we want to capture even more properties. As we add more and more dimensions, more and more characteristics can be stored in our multidimensional space. This means that we have more and more ways to categorize the products.
In the case of embedding, we have much more dimensions than 2 or 4. OpenAI's vector space uses 1536. Their API makes it very easy to embed any information, be it words, sentences, or paragraphs. They provide a dedicated model for embeddings: text-embedding-ada-002.
At this point, you might ask yourself: How did they come up with that number? And the answer is the same as in the case of other models: With lots of trial and error. They had earlier models with more than 10k dimensions, but it turned out that their newest model actually performs much better than those large models and costs 99.8% less. However, the dimension count is not a general rule of thumb. Other companies and providers might use a different number of dimensions.
It's important to mention that it only makes sense to compare a single model's dimension count and capabilities if they are increased. Since we don't know how information is actually stored in the vectors, comparing a 1000-dimension model from provider A to a 2000-dimension model from provider B doesn't hold much information. They might use completely different methods to search for relevance.
Just like other language models, models which utilize embedding are trained. However, their primary objective isn't to replicate natural language; rather, it's to translate information into a numerical format for categorization. These models are specifically trained to find relevance between multiple pieces of data. They are not proficient in chatting, though. When you integrate embedding in your application, you use one of these models to find relevant information, but you use a different one (GPT-4, for example) to communicate the result to the user.
Each model is responsible for the task that it is most proficient in. GPT-4 won't give you embedding vectors, and text-embedding-ada-002 won't speak with you like ChatGPT does.
So what does it mean to embed something? It means that your input is transformed into the vector space mentioned above. The model will return a 1536-dimensional vector full of numbers. Consider this like you insert a new product into the hypermarket. In the analogy, the hypermarket is the database, and the product is a single record. You know where the product should be placed because the company trained you earlier to do so, and you have a guideline that helps determine the correct location.
With the vectorial representation in your hand, you go to a database that supports vectors, and you insert it. Postgres with the pgvector extension is perfectly fine for this purpose. A single database record consists of the text you wanted to embed and its vectorial representation. Let's see what you can do with a database full of vectors.
Up until this point, all of our examples were about semantic searching in embedded vectors. Although searching is one of the most trivial uses of embedding, other use cases besides that one can also utilize the vectorial representation of information. In the case of searching, you have an input string, and you want to find every record in your database that could be related to it. First, you calculate the distance between your question and your records. The results are then ranked from the most relevant to the least relevant. Now let's see some other candidates on the list.
In the case of clustering, the goal is to search for groups of similar pieces of text in the vector space. Here, you don't have an input text to find relevance to. Instead, in clustering, your goal is to discover hidden groupings in a very large text-based dataset. Since vector representations are semantically meaningful, meaning that related terms are close to each other in that space, embedding can help you find hidden groupings that are not obvious at first. In this case, you have to run a different database query. You have to retrieve groups of vectors.
The closer the vectors are to each other, the greater the similarity in the meaning of the underlying text. While this use-case is very similar to searching, you can also create a recommendation algorithm pretty easily by returning a list of strings related to the 'source' string. You can also rank them by relevance in this case, just like we did at searching.
You do the opposite of what you did in previous use cases to detect anomalies. You have to find the outliers in the vector space; those items must be identified that have very little relatedness to other vectors in the space.
Let's say you embed the system log of your computer on every event. Normal operation will generate the same vectors every time since you don't do anything suspicious. But if there is a breach, those system logs will probably contain some unusual rows. When your script embeds the malicious records, they will most likely be located very far from the usual vectors.
And voilà, anomaly detection!
Classification is pretty similar to clustering, where texts are grouped by similarity, but in this case, the goal is to find the most similar label that describes a group of strings and then use that string for classification.
The above list is not exhaustive but gives a good overview of the different types of embeddings. You may find more information here. As a next step, let's see what distance and relevance mean in embedding.
There are multiple different ways to calculate relevance between two vectors. Some measure distance, and some measure similarity. Here is an example for each.
This is the most trivial solution to the question. Euclidian distance is the shortest straight line you can draw between two different points in space. Usually, this distance is calculated between the endpoints of two vectors. This method is a way to calculate distance and is best used when both vectors are dense, meaning that they contain multiple non-zero attributes. When you think about vectorial distance, this is what intuitively comes to mind.
In this method, you calculate the cosine of the angle between the two vectors, which is a measure of similarity, unlike the Euclidean distance. When using cosine similarity, the goal is to measure how similar the directions of the two vectors are instead of their magnitude. It is a common choice in natural language processing since the direction of the word vectors often captures semantic meaning.
How can you choose between the two? Well, it depends on the use case, but here is a representative example: imagine you want to build a movie recommendation algorithm. You want to utilize embedding for this purpose, but first, you have an important decision to consider. Do you want to give general recommendations where only the relatedness of two movies is important, or do you also want to explain the specific characteristics based on which a new movie is recommended?
If you choose the first one, you are more interested in the movies simply being related (vectors having the same direction). The obvious choice would be to use Cosine similarity in this case. However, if you are also interested in their specific characteristics (magnitude of the vectors), you should go with the Euclidean distance.
Hopefully, you now understand the types of embeddings and what distance and similarity calculations mean. But we still owe you one very important explanation. What should you do when you retrieve the vectors from your database? To answer it, let's continue with semantic searching for the sake of simplicity.
As a recap, here's where you are: You have embedded your information, e.g., a book, documentation, or every page of your website. When you want to ask or search for something, in the first step, you also embed the question, but you won't store it. Instead, you calculate a vectorial distance or similarity between your question and every embedded item in your database. As a result, this operation will return every piece of information related to your question.
Having every related information to your question, you inject it in your prompt before the question as context. Then in the last step, you ask GPT-4 to answer based on the above context. You don't have to inject anything by hand. Embedding enables you to craft the context dynamically based on your (or the user's) question.
To clear things up, this process doesn't apply to ChatGPT. ChatGPT does not support the external usage of embeddings. You do the above from code by calling their API.
The unbelievable thing here is that embedding will find relevant information even if it does not contain the word itself. When you search for animals, it will find dogs, cats, birds, etc. But, it will even find the word rottweiler, hummingbird, or Garfield — no need to tag any record anymore. You just store your texts and let embedding do its job.
Imagine using the GPT-4 model with a 32k context size on your website. It is a general website assistant. The user's question comes in, and your backend searches for everything relevant to it in the vector database, where everything about your site is embedded beforehand. It creates a 31K-token-long context from your vector database, injects it into the prompt, and leaves 1K tokens for the user's question. Your language model (GPT-4) will answer based on this context that contains everything that could interest the user.
How perfect will that response be?
I feel this question has to be answered. LongNet is a recent language model variant capable of scaling the context length to more than one billion tokens without sacrificing performance on shorter inputs. What will be the role of embedding in the future if the context limit is increasing so rapidly?
An infinite context is indeed very impressive and solves basically any problem that would arise from having a shorter limit. On the one hand, models will still use embedding inside because they have to translate tokens to vectors to understand them, but this isn't too interesting for you as a developer.
On the other hand, knowing how to implement embedding in your application can still come in handy. Considering the privacy concerns regarding model providers, it looks like smaller, self-hosted models will become more and more popular in the future. If this is the case, those models will definitely have a context limit, in which case embedding will still play a crucial role.
We have no idea what the future brings and how soon some well-known events like the singularity will happen. Whether embedding is a useful skill to master after that point is a good question, but for now, it looks like it will stay with us, and it is worth at least understanding it.
We covered the most common types of embeddings and showed the difference between distance and similarity. We also presented the hypermarket example to understand semantic searching better.
Finally, we had a short outlook into the future and saw what role embedding will play in it. You should now have a rough understanding of what embedding is. Hopefully, it is enough to motivate you to start building an application that utilizes it. How about a personal assistant that runs offline on your computer?
Take the first step toward harnessing the power of AI for your organization. Get in touch with our experts, and let's embark on a transformative journey together.
Contact Us today