Searching Text in C# Using Embeddings with OpenAI

| May 7, 2024 | in

Software Development Tips & Tricks

The problem we’re solving today is semantic searching text in C#, using tools provided by OpenAI. A semantic search in this instance refers to searching on the searcher’s intent and the contextual meaning of the query, rather than a simple keyword lookup (for example, search results for “king” may include matches on the word “monarch”). Thanks to OpenAI’s “create embeddings” endpoint, this will take a surprisingly small amount of code.

In this example, we’ll search a subset of an old dataset of “fine-food reviews” from Amazon. In practice, you can search whatever you’d like.

I’m a big fan of OpenAI’s documentation and tutorials, but they lean heavily on Python code samples and packages, which is great if you’re using Python and less great if you want to use C#. In this post, I’ll walk through the details of their text embeddings guide and text-search tutorial, and provide a working example in C# that you can follow along with.

This post assumes familiarity with C# and OpenAI. We’ll discuss a few mathematical concepts, but you won’t need a PhD to make this work. If all you want is the finished product, you can find it here: https://github.com/tarrball/csharpsearchembeddings. Additionally, you’ll need an OpenAI API Key to use their API.

The Process

There are two parts in the search process:

Prepare text to be searched.
Search text.

To prepare text to be searched, we will:

Determine the text we want a record to be searched on.
Convert that bit of text into embeddings. Embeddings in this context will refer to 512-dimensional vectors of floating-point numbers.
Save those embeddings. Save in this context will refer to keeping those vectors in memory, though in practice you would likely use a vector database.

To search text, we will:

Convert your search query text into embeddings.
Compute the distance between a query’s embeddings and text embeddings.

Prepare Your Text

If you choose to, you can follow along with this post with the sample project I’ve uploaded here: https://github.com/tarrball/csharpsearchembeddings. This repository contains two things: the first 1,000 records from an Amazon fine-food reviews dataset and a Jupyter Notebook. A Notebook is basically a Markdown document with executable blocks of code in it. It’s very handy for sharing working samples and tutorials. Notebooks are more commonly written in Python, but we’re using C# today, so you’ll need VS Code and Microsoft’s “Polyglot Notebooks” extension to run this thing. Don’t forget to select the .NET Interactive kernel if it’s not detected for you.

First thing’s first, we must read the dataset into memory. It’s a CSV, so I’m using the CSVHelper package to read it into memory and map it onto a class (in Python, you’d likely use a Pandas DataFrame). With CSVHelper, we will supply the file, a class to map rows into, and a bit of a mapper configuration class. I’ve created an AmazonReview class to mirror what’s in the CSV, and an AmazonReviewMap class that’ll allow me to add a couple properties that are not in the CSV (“Combined” and “Embeddings”). The Combined property will be a combination of “Summary” and “Text”, and Embeddings will contain the embeddings we compute from Combined. Here we go:

In this example, we’ve combined everything we want to search our reviews on into the Combined property. We haven’t discarded any of the original values, as we’ll obviously want to reference those to make our search results more meaningful. We wouldn’t want our semantic search to look for ProductId, but we’d want it for any follow-on actions we’d take in practice.

The next thing we’ll need is, of course, the embeddings. There is no “batch” embeddings endpoint on the OpenAI API at present, so what we’ll do is loop through every review, grab the embeddings using the OpenAI SDK, and store them in our collection of reviews. This loop will take 3-4 minutes, require an API key, and cost you about $0.01 USD. Here’s the code, I’ll elaborate on a few details next:

I’m using the “small” and currently most recent embeddings model here because it’s the cheaper of the two and will likely not change the results of this exercise. There is a larger model available at a slightly more expensive price point. The most obvious difference between the two models is the number of dimensions in the vectors they’re capable of retrieving (1536 and 3072 respectively). I’m instructing the embeddings endpoint to limit the dimensions to 512, which OpenAI claims can be done without a noticeable dip in quality. Shrinking down the vectors will save us compute time and resources later when we want to store and search these things. As with other OpenAI products, their general guidance is to start as cheap as possible and scale up only when that solution isn’t working. If you’re wondering why I’m casting the results to floats (this SDK returns doubles), this is for use in a distance function that we’ll use shortly.

One thing I’m not doing in this sample, but does appear in OpenAI’s tutorials and samples, is tokenizing my text prior to creating the embeddings. OpenAI uses their tiktoken Python package to count and filter out reviews which contain too many tokens for the API to handle. I couldn’t find an amazing solution for this in C# and it didn’t seem all that important for this proof of concept, so I skipped that step. I don’t think there are any reviews in this dataset that exceed the token limit for the API, but you might need to consider this as a reality in practice.

Search Your Text

Hang in there, we’re almost at the payoff. To search the embeddings we created and now have at the ready, first convert the query text into embeddings using the same method as above, then calculate its similarity between those embeddings and your “database” of embeddings using the cosine similarity function (from the “System.Numerics.Tensors” package). That similarity will be a value between 0 and 1 (perhaps between -1 and 1, but I haven’t observed any negatives yet). Order the results descending on similarity and you’ll get the best results at the top.

In my sample, I’ve created a new class to hold a reference to the review as well as the similarity value or relatedness. I’ve created a Search method which returns the top five results by default:

If you’ve got the Notebook up and running and you’re still with me, pop in a query and give it a whirl. If you search for “really negative reviews”, you should get results similar to:

Relatedness	Summary
0.4616008400917053	OMG DO NOT BUY!!!
0.4523474872112274	ABSOLUTELY VILE!!!
0.4389459192752838	Reeks like chemicals

Wow

The first time I read through the documentation on embeddings, I thought I understood it in the abstract well enough. I then tried to explain it to another person or two and realized I wasn’t quite there yet. It took some Python, a bit of GPT, and a sprinkle of Google, but I got there. Now it all seems elementary. That said, it still amazes me how little code on my part can be used to make something this cool.

I’m very thankful to the army of others out there doing the difficult stuff.