How to quickly implement a text search system using pyvespa
By Francisco Pijuan
Intro
In modern information retrieval systems, machine learning models are leveraged. These models are much heavier and present various challenges to implement, which means that custom solutions are almost impossible to scale and distribute. Existing solutions like Elastic Search lack functionality as they are not designed for deep learning models. This is where Vespa comes to make things easier.
This blog is for those curious about innovative solutions to the problems that arise from implementing an information retrieval system. Here we will cover a quick and easy way to implement a text search system using Vespa’s python API, pyvespa.
What is Vespa
Vespa is a platform for applications that needs low-latency computation over large databases. It stores and indexes your structured, text and vector data so that queries, selection and processing and machine-learned model inference over the data can be performed quickly at serving time at any scale. It allows you to write and persist any amount of data, and execute high volumes of queries over the data which typically complete in tens of milliseconds.
Queries can use nearest neighbor vector search, text, and filter conditions to select data. All the matching data is then ranked according to ranking functions – typically machine learned – to implement such use cases as search relevance, recommendation, targeting and personalization.
All the matching data can also be grouped into groups and subgroups where data is aggregated for each group to implement features like graphs, tag clouds, navigational tools, result diversity and so on.
Working with python
Vespa is the scalable open-sourced serving engine to store, compute and rank big data at user serving time. pyvespa provides a python API to Vespa – use it to create, modify, deploy and interact with running Vespa instances. The main goal of the library is to allow for faster prototyping and to facilitate Machine Learning experiments for Vespa applications.
pyvespa provides a python API to Vespa. The library’s primary goal is to allow for faster prototyping and facilitate Machine Learning experiments for Vespa applications:
- Build and deploy a Vespa application using pyvespa API.
- Connect to an existing Vespa application and run queries from python.
- Import a Vespa application package from files and use pyvespa to access it.
What is Vespa used for
With low-latency computation over large databases a new world of possibilities for new applications and features opens up. These are some of the most well known problems people use Vespa to solve:
- Text Search: Vespa has a full-featured text search engine with full support for traditional information retrieval as well as modern embedding based techniques.
- Recommendation and personalization: Recommendation, content personalization and ad targeting are all the same thing when it comes to implementation: For a given user or context, evaluate machine-learned content recommender models to find the best items and show them to the user.
- Question answering: Question answering provides direct answers to the user’s question.
- Semi-structured navigation: Applications that use semi-structured data – that is a combination of data-base like data and plain text – usually benefit from allowing users to navigate in the data using both structured navigation and text search.
- Personal search: Personal search is to provide search in personal collections of data where there is never a need to search across many collections in a single query.
- Typeahead suggestions: any applications which make use of textual input make use of typeahead suggestions, where a number of suggested completions are presented while the user is typing. This usually involves searching and ranking matching candidate completions with really low latency – a suitable job for Vespa.
In this post we will focus on text search with vespa, so let’s check what vespa can do for us.
Text search with Vespa
“Vespa is a full-featured text search engine with full support for traditional information retrieval as well as modern embedding based techniques. … these approaches can be combined efficiently in the same query and ranking model…”
The search applications usually make use of these features of Vespa:
- Full text search with support for word position based matching and relevance, and advanced operators like WAND over text terms.
- Fast approximate nearest neighbor search (ANN) in vector spaces, based on the HNSW algorithm.
- Matching by structured metadata.
- Combining multiple of the above matching operators freely in the same query by AND and OR.
- A large set of relevance features including bm25 and more advanced text features using positional information, geo features, time and so on.
- Ranking by arbitrary mathematical expressions over scalar and tensor features, as well as by models created in LightGbm, XGBoost, TensorFlow or any tool supporting ONNX. This includes support for modern transformer based language models, which are evaluated on the content nodes like other ranking expressions for scalability.
- Support for 2-phase ranking.
- Surfacing any document data in results, with optional support for static or dynamic snippeting with highlighting.
- Deploy application specific query, result and document processors in Java deployed as part of the application.
- Grouping, deduplication and aggregation over all matches.
In this tutorial we will focus on most of these Vespa functions, leaving the import of models to Vespa out of the scope. Now that we know what vespa is and what is used for, let’s dive into how to quickly implement a text search system using pyvespa.
First thing’s first: Data
In this example of using pyvespa we will be using data from amazon products. Our dataset has the following format:
asin | title | description | brand | main_cat | price | embeddings |
B00002N62Y | Eureka 54312-12 Vacuum Cleaner Belt | Eureka Replacement Vacuum Belt | Eureka | Amazon Home | $4.36 | {‘values’: [1.3838, 0.5255, -1.0725, -1.1171, … |
B00002N8CX | Eureka Mighty Mite 3670G Corded Canister Vacuu… | The Mighty Mite canister vacuum is equipped wi… | Eureka | Amazon Home | $12.97 | {‘values’: [1.79, 0.4431, -0.3274, -1.2027, 1…. |
A couple of things to take into account when trying to upload data to the vespa app, we need a unique identifier for each row (in this case is going to be ‘asin’ which is the product identifier) and the embedding data should be a dictionary with a key named ‘values’ and the value should be the tensor.
Before importing the data, this have to be in json format, like this:
[{'asin': 'B00002N62Y', 'title': 'Eureka 54312-12 Vacuum Cleaner Belt', 'category': 'Home & Kitchen Vacuums & Floor Care', 'description': 'Eureka Replacement Vacuum Belt', 'feature': 'Limit 1 per order Returns will not be honored on this closeout item', 'rank': '>#1,098,930 in Home & Kitchen (See Top 100 in Home & Kitchen)>#17,327 in Home & Kitchen > Vacuums & Floor Care', 'brand': 'Eureka', 'main_cat': 'Amazon Home', 'price': '$4.36', 'embeddings': {'values': [1.3838, 0.5255, -1.0725, ..., 0.456, 0.8432, 0.3859]}}]
Now we are ready to create our text search system with vespa.
How to quickly implement a text search system using pyvespa:
- Create an application package
- Add fields to the schema
- Search multiple fields when querying and define how to rank the documents matched
- Build the application package
- Deploy the vespa application
- Feed and update the data
- Query the text search app using the Vespa Query language
1. Create an application package
Create an application package , do not use a – in the name:
from vespa.package import ApplicationPackage app_package = ApplicationPackage(name="completePipeline")
2. Add fields to the schema
Add fields to the application’s schema:
In terms of data, Vespa operates with the notion of documents. A document represents a single, searchable item in your system, e.g., a news article, a photo, or a user. Each document type must be defined in the Vespa configuration through a schema. The data fed into Vespa must match the structure of the schema, and the results returned when searching will be in this format as well. There is no dynamic field creation support in Vespa, one can say Vespa document schemas are strongly typed.
from vespa.package import Document, Field, HNSW products_document = Document( fields=[ Field(name = "id", type = "string", indexing = ["attribute", "summary"]), Field(name = "asin", type = "string", indexing = ["attribute", "summary"], attribute = ['fast-search','fast-access']), Field(name = "title", type = "string", indexing = ["index", "summary"], index = "enable-bm25"), Field(name = "description", type = "string", indexing = ["index", "summary"], index = "enable-bm25"), Field(name = "brand", type = "string", indexing = ["attribute", "summary"]), Field(name = "main_cat", type = "string", indexing = ["attribute", "summary"]), Field(name = "price", type = "string", indexing = ["attribute", "summary"]), Field(name="embeddings", type="tensor<float>(x[512])", indexing=["attribute", "index"], ann=HNSW( distance_metric="euclidean", max_links_per_node=16, neighbors_to_explore_at_insert=500), ) ] )
The document is wrapped inside another element called schema.
This document contains several fields. Each field has a type, such as string, int, or tensor. Fields also have properties. For instance, property indexing configures the indexing pipeline for a field, which defines how Vespa will treat input during indexing.
- index: Create a search index for this field.
- attribute: Store this field in memory as an attribute — for sorting, querying, ranking and grouping.
- summary: Lets this field be part of the document summary in the result set.
Here, we also use the index property, which sets up parameters for how Vespa should index the field. For some fields, we configure Vespa to set up an index compatible with bm25 ranking for text search.
We add the embeddings field defined to hold a one-dimensional tensor of floats of size 512. We will store the field as an attribute in memory and build an ANN index
using the HNSW
(hierarchical navigable small world) algorithm.
Some useful information when defining a document:
Matching
Consider the title field from our schema, and the document for the product with title “Eureka 54312-12 Vacuum Cleaner Belt”. In the original input, the value for title is a string built of up to 5 words, with a single white space character between them. How should we be able to search this field?
For string fields with index which default to match:text
, Vespa performs linguistic processing of the string. This includes tokenization, normalization and language dependent stemming of the string.
In our example, this means that the string above is split into the 5 tokens, enabling Vespa to match this document for:
- the single-term queries such as “Eureka”, “Vacuum” and “Cleaner”,
- the exact phrase query “Eureka 54312-12 Vacuum Cleaner Belt”,
- a query with two or more tokens in either order (e.g. “Eureka Vacuum”).
This is how we all have come to expect normal free text search to work.
However, string fields with indexing:attributes
do not support match:text
, only exact matching or prefix matching. Exact matching is the default, and, as the name implies, it requires you to search for the exact contents of the field in order to get a match. See supported match modes and the differences in support between attribute and index.
Memory usage
Attributes are stored in memory, as opposed to fields with index, where the data is mostly kept on disk but paged in on-demand and cached by the OS buffer cache. Even with large flavor types, one will notice that it is not practical to define all the document type fields as attributes, as it will heavily restrict the number of documents per search node.
When to use attributes
There are both advantages and drawbacks of using attributes — it enables sorting, ranking and grouping, but requires more memory and does not support match:text capabilities.
When to use attributes depends on the application; in general, use attributes for:
- fields used for sorting, e.g. a last-update timestamp,
- fields used for grouping, e.g. category, and
- fields accessed in ranking expressions
Finally, all numeric and tensors fields used in ranking must be defined as attributes.
Combining index and attribute
Combining both index and attribute for the same field is supported. In this case, we can sort and group on the category, while search or matching will be using index matching with match:text, which will tokenize and stem the contents of the field.
3. Search multiple fields when querying and define how to rank the documents matched
A FieldSet groups fields together for searching – it configures queries to look for matches both in the titles and bodies of the documents:
Specify how to rank the matched documents by defining a RankProfile. Here, the bm25
rank profile combines BM25 scores from title
and description
:
from vespa.package import Schema, FieldSet, RankProfile products_schema = Schema( name="products", document=products_document, fieldsets=[FieldSet(name="default", fields=["title", "description"])], rank_profiles=[ RankProfile(name="bm25", inherits="default", first_phase="bm25(title) + bm25(description)"), RankProfile(name="nativeRank", inherits="default", first_phase="nativeRank(title,description)"), RankProfile(name="embedding_similarity", inherits="default", first_phase="closeness(embeddings)"), RankProfile(name="bm25_embedding_similarity", inherits="default", first_phase="bm25(title) + bm25(description) + closeness(embeddings)") ] )
For the products schema, we define four rank profiles. The embedding_similarity
uses the Vespa closeness
ranking feature, which is defined as 1/(1 + distance)
so that products with embeddings closer to the query embedding will be ranked higher than products that are far apart. The bm25
is an example of a term-based rank profile, and bm25_embedding_similarity
combines both term-based and semantic-based signals as an example of a hybrid approach.
4. Build the application package
The tensors used in queries must have their type declared in a query profile in the application package. The code below declares the text embedding that will be sent through the Vespa query. It has the same size and type of the product embedding.
from vespa.package import ApplicationPackage, QueryProfile, QueryProfileType, QueryTypeField app_package = ApplicationPackage( name="myapp", schema=[products_schema], query_profile=QueryProfile(), query_profile_type=QueryProfileType( fields=[ QueryTypeField( name="ranking.features.query(embedding_text)", type="tensor<float>(x[512])", ) ] ) )
Here we define the schema to load to the application. In this example we added the products schema only but it can be more than one. Once we have the application package built we can visualize the components of the package like the schema:
print(app_package.get_schema('products').schema_to_text)
schema products { document products { field id type string { indexing: attribute | summary } field asin type string { indexing: attribute | summary attribute { fast-search fast-access } } field title type string { indexing: index | summary index: enable-bm25 } field description type string { indexing: index | summary index: enable-bm25 } field brand type string { indexing: attribute | summary } field main_cat type string { indexing: attribute | summary } field price type string { indexing: attribute | summary } field embeddings type tensor<float>(x[512]) { indexing: attribute | index attribute { distance-metric: euclidean } index { hnsw { max-links-per-node: 16 neighbors-to-explore-at-insert: 500 } } } } fieldset default { fields: title, description } rank-profile bm25 inherits default { first-phase { expression: bm25(title) + bm25(description) } } rank-profile nativeRank inherits default { first-phase { expression: nativeRank(title,description) } } rank-profile embedding_similarity inherits default { first-phase { expression: closeness(embeddings) } } rank-profile bm25_embedding_similarity inherits default { first-phase { expression: bm25(title) + bm25(description) + closeness(embeddings) } } }
If you decide to build the most basic application directly using Vespa configuration files you will end up with something like this:
myapp/ ├── schemas │ └── schema_file.sd └── services.xml
Visualizing this data can help you to build the vespa application outside pysvespa faster. That is because in the above example, in order to create the schema_file
under the schemas folder, you will only need to copy and paste the data you got from the get_schema()
and schema_to_text()
functions.
5. Deploy the vespa application
The text search app with fields, a fieldset to group fields together, and a rank profile to rank matched documents is now defined and ready to deploy. Deploy app_package
on the local machine using Docker, without leaving the notebook, by creating an instance of VespaDocker:
import os from vespa.deployment import VespaDocker vespa_docker = VespaDocker() app = vespa_docker.deploy(application_package=app_package)
app
now holds a Vespa instance, to be used to interact with the application. pyvespa provides an API to define Vespa application packages from python. vespa_docker.deploy
exports Vespa configuration files to disk_folder – going through these is a good way to learning about Vespa configuration.
6. Feed and update the data
We can either feed a batch of data for convenience or feed individual data points for increased control. In this post we will be feeding with a batch of data.
Batch
Feed data
We need to prepare the data as a list of dicts having the id key holding a unique id of the data point and the fields
key holding a dict with the data fields.
batch_feed = [ { "id": product['asin'], "fields": product } for idx, product in enumerate(merged_data_json) ]
We then feed the batch to the desired schema using the feed_batch method.
response = app.feed_batch(schema="products", batch=batch_feed)
Update data
Similarly to the examples about feeding, we can update a batch of data for convenience or update individual data points for increased control.
We need to prepare the data as a list of dicts having the id key holding a unique id of the data point, the fields
key holding a dict with the fields to be updated and an optional create key with a boolean value to indicate if a data point should be created in case it does not exist (default to False
).
batch_update = [ { "id": product['asin'], # data_id "fields": product, # fields to be updated "create": True # Optional. Create data point if not exist, default to False. } for idx, product in enumerate(merged_data_json) ]
We then update the batch on the desired schema using the update_batch method.
response = app.update_batch(schema="products", batch=batch_update)
7. Query the text search app using the Vespa Query language
Query
Query the text search app using the Vespa Query language by sending the parameters to the body argument of app.query:
query = { 'yql': 'select * from sources products where userQuery();', 'query': 'best vacuum cleaner', 'ranking': 'nativeRank', 'type': 'all', 'hits': 5 }
The default query type is using all
, requiring that all the terms match the document.
results = app.query(body=query) print('Number of documents retrieved: '+ str(results.number_documents_retrieved)) print('Number of documents returned: '+ str(len(results.hits)))
Number of documents retrieved: 104 Number of documents returned: 5
We can then retrieve specific information from the hit list through the results.hits or access the entire Vespa response.
[hit["fields"]["documentid"] for hit in results.hits]
['id:products:products::B00LUQEWMK', 'id:products:products::B00HL0L8FS', 'id:products:products::B00D6CNEQQ', 'id:products:products::B00J3031B8', 'id:products:products::B00LERGLR4']
results.hits[1]
{'id': 'id:products:products::B00HL0L8FS', 'relevance': 0.21797787435902613, 'source': 'myapp_content', 'fields': {'sddocname': 'products', 'documentid': 'id:products:products::B00HL0L8FS', 'asin': 'B00HL0L8FS', 'title': 'Dyson Dc40 Animal Upright Best Rated Bagless Vacuum Cleaner', 'description': 'The Dyson DC40 Animal upright vacuum cleaner is designed for all floor types including carpet, tile, vinyl and wood. The DC40 Animal comes with a mini turbine head tool to clean pet hair and dirt from tight and hard to reach places. If you liked the DC24,DC25,DC28,DC33,DC35,DC44 series then you will definitely love this Dyson Radial Root Cyclone technology with remodeled airflows to maximize suction power. Most customers reviews claim that this to be the best rated bagless portable vacuum cleaner even better from commercial upright,Bosch,sharp,riccar and hepa cleaners. It is self-adjusting cleaner head maintains optimal contact, even on hard floors and your car inside. Ball all floors technology makes it easy to steer right up to edges and tight spots. Instant release wand stretches up to 5 times in length for stair and high reach cleaning. Fingertip controls allow you to instantly turn the motorized brush bar off for hard or delicate floors and rugs. Washable lifetime filter captures allergens and expels cleaner air. Includes a combination tool with soft bristles for gentle dusting. Stair tool for removing dirt and dust from corners and vertical edges of stairs. Materials: ABS plastic, metal, electronic components. Dimensions: 12.2 inches wide x 13.8 inches deep x 42.1 inches high, Weight: 14.51 pounds. Included parts: Combination tool, stair tool, mini turbine head tool. Motor: 200 suction power / Capacity: 0.42 gallons / Cord reach: 24 feet / Hose reach: 15.3 feet / Model: 22913-02 / Dyson parts,accessories,bags.', 'brand': 'Dyson', 'main_cat': 'Amazon Home', 'price': '$339.99'}}
We can change retrieval mode from all to any
:
query = { 'yql': 'select * from sources products where userQuery();', 'query': 'best vacuum cleaner', 'ranking': 'nativeRank', 'type': 'any', 'hits': 5 } results = app.query(body=query) print('Number of documents retrieved: '+ str(results.number_documents_retrieved)) print('Number of documents returned: '+ str(len(results.hits))) [hit["fields"]["documentid"] for hit in results.hits]
Number of documents retrieved: 1938 Number of documents returned: 5 ['id:products:products::B00LUQEWMK', 'id:products:products::B00HL0L8FS', 'id:products:products::B00ATSRQXW', 'id:products:products::B0042X5RBI', 'id:products:products::B007IX0OGC']
Which will retrieve and rank all documents that match any of the query terms. As can be seen from the result, almost all documents matched the query. These type of queries can be performance optimized using the Vespa WeakAnd
query operator:
query = { 'yql': 'select * from sources products where userQuery();', 'query': 'best vacuum cleaner', 'ranking': 'nativeRank', 'type': 'weakAnd', 'hits': 5 } results = app.query(body=query) print('Number of documents retrieved: '+ str(results.number_documents_retrieved)) print('Number of documents returned: '+ str(len(results.hits))) [hit["fields"]["documentid"] for hit in results.hits]
Number of documents retrieved: 1835 Number of documents returned: 5 ['id:products:products::B00LUQEWMK', 'id:products:products::B00HL0L8FS', 'id:products:products::B00ATSRQXW', 'id:products:products::B0042X5RBI', 'id:products:products::B007IX0OGC']
In this case, a much smaller set of documents were completely ranked due to the use of weakAnd instead of any and we got the same top 5 ranked products.
In any case, the retrieved documents are ranked by the relevance score, which in this case is delivered by the nativeRank rank feature that we defined as the default ranking-profile in our schema definition file.
Query to embedding and ANN
To create embeddings from the queries, we’ll employ a BERT-based sentence classifier from the HuggingFace transformers library:
from transformers import BertTokenizer, BertModel tokenizer = BertTokenizer.from_pretrained('bert-base-uncased') model = BertModel.from_pretrained('google/bert_uncased_L-8_H-512_A-8') def query_to_embedding(query): tokens = tokenizer(query, return_tensors="pt", max_length=100, truncation=True, padding=True) outputs = model(**tokens) embedding_query = outputs[0].tolist()[0][0] return embedding_query
Here, we use a medium-sized BERT model with 8 layers and a hidden dimension size of 512. This means that the embedding will be a vector of size 512.
Vespa allows you to load models created in LightGbm, XGBoost, TensorFlow or any tool supporting ONNX. For simplicity we decided to use our bert model outside vespa for this tutorial.
Query level retrieval
The query below sends the query text embedding (query_to_embedding(query_text)
) through the ranking.features.query(embedding_text)
parameter and use the nearestNeighbor
search operator to retrieve the closest 100 sentences in embedding space using Euclidean distance as configured in the HNSW
settings. The products returned will be ranked by the embedding_similarity
rank profile defined in the products
schema.
query_text = 'Eureka Light Speed 200 No-touch Bagless Upright Vacuum Cleaner - Red' query = { 'yql': 'select * from products sentence where ({targetHits:100}nearestNeighbor(embeddings,embedding_text));', 'hits': 5, 'query' : query_text, 'ranking.features.query(embedding_text)': query_to_embedding(query_text), 'ranking.profile': 'embedding_similarity' } results = app.query(body=query) print('Number of documents retrieved: '+ str(results.number_documents_retrieved)) print('Number of documents returned: '+ str(len(results.hits))) [hit["fields"]["documentid"] for hit in results.hits]
Number of documents retrieved: 100 Number of documents returned: 5 ['id:products:products::B003RL86FU', 'id:products:products::B00D41LZFQ', 'id:products:products::B003JKH5MY', 'id:products:products::B00GAWPRUA', 'id:products:products::B00008OIG6']
Query level hybrid retrieval
In addition to sending the query embedding, we can send the query string (query_text
) via the query
parameter and use the or
operator to retrieve documents that satisfy either the semantic operator nearestNeighbor or the term-based operator userQuery
. Choosing type
equal any
means that the term-based operator will retrieve all the documents that match at least one query token. The retrieved documents will be ranked by the hybrid rank-profile bm25_embedding_similarity
.
query_text = 'Eureka Light Speed 200 No-touch Bagless Upright Vacuum Cleaner - Red' query = { 'yql': 'select * from products sentence where ({targetHits:100}nearestNeighbor(embeddings,embedding_text)) or userQuery();', 'hits': 5, 'query' : query_text, 'type': 'any', 'ranking.features.query(embedding_text)': query_to_embedding(query_text), 'ranking.profile': 'bm25_embedding_similarity' } results = app.query(body=query) print('Number of documents retrieved: '+ str(results.number_documents_retrieved)) print('Number of documents returned: '+ str(len(results.hits))) [hit["fields"]["documentid"] for hit in results.hits]
Number of documents retrieved: 1938 Number of documents returned: 5 ['id:products:products::B003RL86FU', 'id:products:products::B00IE3OC1I', 'id:products:products::B00CTQ8RTY', 'id:products:products::B00B34TW4M', 'id:products:products::B008MM5LRA']
8. Cleanup
vespa_docker.container.stop(timeout=600) vespa_docker.container.remove()
Final thoughts
Today you learned the basics of Vespa, from setting up a document to making hybrid queries using the query as text and the query as a tensor. Take into consideration that pyvespa is meant to be used as an experimentation tool for Information Retrieval (IR) and not for building production-ready applications. So if your application requires functionality or fine-tuning not available in pyvespa, simply build it directly using Vespa configuration files as shown in many examples on Vespa docs.
Here at Marvik we are always looking to apply these types of innovative solutions. If you are looking to learn more about how to implement Vespa in your project, reach out to [email protected] and we can help you out.