Introduction
This article is created to familiarize you with Couchbase ‒ a non-relational database, its architecture, data structure, and indexes that allow increasing the speed of executing queries to the database.
Couchbase concepts and architecture
Couchbase is an open-source distributed document-oriented NoSQL database designed to ensure low-latency data management for large web, mobile, IoT applications.
Couchbase services
Before you start studying indexes, you need to identify the services that make up Couchbase, because the process of indexing involves different services.
Figures 1 and 2 show an exemplary architecture of a Couchbase cluster.
A cluster is a configuration that combines multiple Couchbase servers into a single distributed data warehouse. When Couchbase servers are clustered, they all provide the same functionality, interfaces to access the same data. Each element of the Couchbase cluster (node) can control the entire cluster, and each node has statistics about the operations of the entire cluster.
A cluster is horizontally scalable. To increase the size of a cluster, additional servers (nodes) need to be added. Cluster manager is a program that coordinates all the actions of nodes and provides a simple interface for managing the entire cluster.
Couchbase Server provides several services. They can be deployed, maintained, and provided independently (on different nodes) to enable multidimensional scaling. For example, several services (Figure 1) or one service can be located on a single node (Figure 2):
- Data Service — supports storage, configuration and retrieval of data determined by the key.
- Query Service — analyzes queries in N1QL, executes them and returns results. Query Service interacts with both Data Service and Index Service.
- Index Service — creates indexes for using Query Service and Analytics Service, supports the creation of primary and secondary indexes for items stored in Data Service (more information is given below).
- Search Service — creates indexes designed for full-text search (more information is given below).
- Analytics Service — ensures the possibility to manage data in parallel, allows executing complex analytical queries (more information is given below).
- Eventing Service — supports the processing of data changes in almost real time: in response to changes in the document or according to the timer schedule.
- Backup Service — supports scheduling of full and partial data backups either for individual buckets or for all buckets in a cluster.
Data can be replicated between cluster nodes to ensure that the loss of a node does not result in data loss. Information describing how to set up a cluster, nodes, and services is given in the official documentation.
Couchbase data architecture
Indexes are created for certain scopes (structure levels).
Couchbase data structure is shown in Figure 3.
In Couchbase, the following levels of data storage structure are distinguished:
- cluster;
- bucket;
- scope;
- collection;
- document.
Cluster is described above.
A bucket is the primary storage space on a server. Each bucket contains a hierarchy of scopes and collections for logical grouping of keys and values (i.e., documents). You can learn more about them here and here. Information on configuring the data storage structure is available here.
Couchbase Indexes
An index is a data structure that provides a quick and efficient way to access data as opposed to scanning a large number of documents. Indexes improve query performance.
A separate data structure is created during indexing. It compares the values of stored data (values in columns, documents) with the corresponding locations on the physical drive, which allows the database to quickly find rows for a specific query without the need to scan the entire table.
Indexes are used by the following services for search procedures: Query, Analytics and Search. Query Service uses the indexes provided by Index Service. Search Service and Analytics Service provide their own internal indexes. Indexes are created based on different components of Couchbase data, so let’s consider them in detail.
Couchbase Server supports a flexible data model using JSON. Data is stored as documents with unique keys created by users:
[ { "id": "airline_10", (1) "travel": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", "type": "airline" } (2) } ]
(1) — Document key.
(2) — Document body.
The following types of indexes are available in Couchbase:
- Primary.
- Secondary.
- Full Text (Search Service index).
- Analytics (Analytics Service indexes).
- View — outdated since version 7.0 and will be removed in later versions. More information is available in the official documentation.
Let’s consider Primary, Secondary, Full text, Analytics indexes in detail.
Primary index
Primary — based on the unique key of each element (document) in the specified collection. The primary index is designed to use simple queries that do not have filters (operators that allow selecting subsets of query data, which meets certain conditions, for example, WHERE t1.type = “airline”) or predicates (logical expressions). It is provided by Index Service.
It is a sorted list of all document keys.
The primary index can reduce query performance because it first selects all documents (an analogue of “full table scan” — full scan of all documents from start to end), and then filters by the specified attribute. To improve query performance, we recommend using a secondary index.
Primary indexes are optional and are only required to make special queries that are not supported by the secondary index. For example, it can be used when full scan of a specific bucket is required — counting all documents:
SELECT COUNT(*) FROM `bucket-name`
Example of creating a primary index:
CREATE PRIMARY INDEX idx_airport_primary ON `travel-sample`.inventory.airport USING GSI;
where:
- CREATE PRIMARY INDEX — an operator that allows creating a primary index.
- idx_airport_primary — index name.
- `travel-sample`.inventory.airport — reference to the data key space in which the index is created: `travel-sample` — bucket, inventory — scope, airport — collection.
- USING GSI — specifies the index type. In Couchbase Server 6.5 and later, the primary index type must be a Global Secondary Index (GSI). These keywords are optional and may be omitted.
Examples of creating primary indexes are available here.
Secondary index (GSI)
Secondary – it is often referred to as the Global Secondary Index (GSI). It is based on an attribute inside the document. The value associated with an attribute can be of any type: just a value, an object, or an array. Index Service is presented.
In the previous example, we highlighted the document key (1), on which the primary index is based. In the following example (3), the attribute of the document is highlighted, which is based on the secondary index.
[ { "id": "airline_10", (1) "travel": { "callsign": "MILE-AIR", "country": "United States", "iata": "Q5", "icao": "MLA", "id": 10, "name": "40-Mile Air", (3) "type": "airline" } (2) } ]
Example of a secondary index based on the ‘name’ attribute:
CREATE INDEX `idx-name` ON airline(name);
The ‘name’ attribute contains just a scalar value: the text — “40-Mile Air”. There can also be an object, an array.
A secondary index can also be created for multiple attributes (composite secondary index):
CREATE INDEX travel_info ON airline(name, id, icao, iata);
Or it can include different functions (functional index):
CREATE INDEX travel_cx1 ON airline (LOWER(name));
More examples of the secondary index are available here.
In Couchbase Server 7.0 and later, a global secondary index is created for a single collection, not for the entire bucket (but a single collection can have multiple indexes).
Why is it better to use a secondary key instead of a primary key?
All data has unique keys based on which the primary index is built. But most applications need queries by certain data fields, for example, by the name of the institution, the city, the possibility of payment, and so on, i.e. by secondary indexes.
So, customers can submit complex queries, such as getting the names of all restaurants in a given city, if a secondary index has been created for that city. The database will simply go to the index of this city and view all documents/records in it. If the customer wants to execute the same query without secondary indexes, then the database will search through all documents/records in a row to check if the restaurant is in a given city, and then it will provide the name. So, it becomes obvious that the use of secondary indexes in most cases makes queries to the database much more efficient in comparison with primary indexes or with queries without indexes in general.
Primary and global indexes are stored on a special service — Index Service, which must be enabled in the cluster on at least one of the nodes. When there is an index call, the following occurs:
- An N1QL query call gets to one of the nodes with Query Service.
- Query Service refers to Index Service, which has information about all indexed documents on all nodes in the cluster.
- Query Service aggregates the result (data obtained from different cluster nodes) and returns a response.
The following diagram shows the workflow of a non-indexed query:
The following diagram shows the query execution workflow with indexes:
The second diagram shows that a query using the index avoids additional steps to retrieve data from Data Service. This results in a significant performance improvement.
Full Text
Full Text is an index that contains objects derived from the text content of documents in one or more keys. It is provided by Search Service.
Full Text index allows full-text search.
Full-text search refers to methods of finding text in a document or document collection. It involves searching for some text in extensive text data and returning results containing some or all of the words from the query.
Full-text search provides ample opportunities for queries in natural language. This allows applying special search restrictions to text queries (for example, text matches of varying degrees of accuracy, similar to Google search).
Natural language support allows searching for the word ‘traveling’ and additionally getting results for ‘travel’ and ‘traveler’. Full-text search also evaluates search results by relevance allowing users to get sets of results containing only the documents that received the highest scores (matches). Thanks to this, the number of search results remains small, even if the total number of returned documents is extremely large. It is possible to search for text matches of varying degrees of accuracy: full text match, search by phrase, regular expression, logical expression. Learn more here.
In the process of full-text search and creation of a full-text index, such Couchbase components as Analyzers are involved. Analyzers convert text into tokens, which allows increasing the flexibility of full-text search.
Analyzers go through each text field of the dataset and break the text into a list of words — tokens.
An index is created by adding each of these words (tokens) with a reference to the document in which the given word (token) can be found.
For example, Figure 6 shows source text documents:
Then the index will look like this (Figure 7):
You can learn more about the operation of Analyzers here and here.
Full-text index scope: it can be created for several collections located in a single scope.
Full-text indexes can be created either from the UI Couchbase or using the REST API, you can learn more here.
Full-text indexes can be created for one document field or for several fields:
- index of one field;
- dynamic index;
- Geopoint index, etc.
Learn more here.
Advantages:
- The full-text index supports various search methods, such as full text matching, phrase search, and regular expression search. This allows for greater flexibility and customizable search options.
- The full-text index is typically used in applications or platforms that rely heavily on textual content, such as content-rich websites, articles, and documentation. It ensures an efficient search for text data and provides users with fast and accurate results.
Analytics index
Analytics index — provided by Analytics Service. Analytics indexes include ways to access Analytics Service shadow data. If changes in operational data result in corresponding changes in shadow data, the analytics indexes are updated automatically.
The analytics service is best suited to run large and complex queries with a lot of data.
It helps users analyze their app data. Users do this by creating shadow copies of data on a separate server dedicated to Analytics Service (it should not be located on a separate server). When shadow datasets are created, they are connected to Data Service, and any changes to the Data Service’s live data are reflected in Analytics Service in near real-time. Therefore, users can request analytics data without slowing down operational data or query services using SQL++ for Analytics.
Figures 8 and 9 show query interaction with Data Service and Analytics Service.
Figures 8 and 9 show that regular requests are processed by three services: Query, Index, Data. Large business requests are processed by Analytics Services located on separate servers.
Analytics Service is a separate service that stores its own data (synchronized with the main data stores), including its own indexes.
The Analytics Service repository includes:
- The primary index with the primary key and data (as we described earlier — (1) Document key, (2) Document content).
- Secondary keys (Secondary indexes) with secondary keys and primary keys. Analytics Service uses a local indexing strategy, so each secondary index is a partner with the primary index — all records of the secondary index refer (by key) to locally stored primary objects. A secondary index can be created for any field(s) of the dataset objects.
Secondary indexes can be used in queries (and accelerate them):
- simple selection requests;
- queries with connections (join, as in SQL);
- queries containing arrays, etc.
You can read how indexes can speed up an analytical query here.
Conclusion
So, it can be concluded that Couchbase is a high-performance distributed database that includes various services, which makes it possible to increase the efficiency and flexibility of queries and the system as a whole.
As for indexes, Couchbase uses different types of indexes that are created by different services, which also affects performance improvement when they are properly designed.
List of references
- https://docs.couchbase.com/server/current/learn/architecture-overview.html
- https://habr.com/ru/companies/ruvds/articles/724066/
- https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/indexing-and-query-perf.html
- https://docs.couchbase.com/server/current/learn/services-and-indexes/indexes/indexes.html
- https://docs.couchbase.com/server/current/n1ql/n1ql-language-reference/covering-indexes.html
- https://blog.knoldus.com/do-we-really-need-primary-index-in-couchbase/
- https://www.couchbase.com/blog/primary-uses-for-couchbase-primary-index/
- http://geekrai.blogspot.com/2017/08/couchbase-primary-vs-secondary-indexes.html
- https://habr.com/ru/articles/738734/
- https://www.couchbase.com/blog/full-text_search_text_analysis/
- https://docs.couchbase.com/server/current/fts/fts-index-analyzers.html
- https://www.mongodb.com/basics/full-text-search
- https://docs.couchbase.com/c-sdk/current/concept-docs/full-text-search-overview.html#6.5@server:fts:full-text-intro.adoc
- https://medium.com/@erayaraz10/full-text-index-in-sql-explained-with-examples-892d88e357d5
- https://habr.com/ru/companies/oleg-bunin/articles/528346/
- https://kovardin.ru/articles/go/ispolzuem-polnotekstovyi-poisk-fts-s-couchbase-a-go-prilozhenii/
- https://www.couchbase.com/blog/web-console-for-full-text-indexes/
- https://www.linkedin.com/pulse/couchbase-analytics-customers-moments-truth-revealed-idris-motiwala
- https://www.vldb.org/pvldb/vol12/p2275-hubail.pdf
- https://www.youtube.com/watch?v=aCfjeTm2r2Q