Apr 12, 2018


Blog Categories

Index Queries in FaunaDB

Most applications need to look up database records by attribute, for instance to find all customers in a given postal code, or to create a game leaderboard by listing players by high score. Without database indexes, these queries would have to examine every record in order to return the correct result. Indexes are storage structures maintained by the database system to facilitate efficient retrieval of subsets of the data.

Card puncher
Entering data into a card index

There are a few types of indexes, in this post we’ll focus two types: term indexes that support lookup by secondary key, and range indexes which support sorted pagination.

Term-based indexes can be used to match a login token to a user record when processing an authentication request, or for other simple queries like listing all customers in a given postal code, or all content with a particular tag. Beyond conceptual simplicity, an advantage of term-based indexes is that they can be supported by lightweight data structures which scale out easily to support big data without performance degradation. A downside is that they are limited to finding items by exact match on the term, and don’t typically support paginating across terms. As such they are a useful part of the toolkit when building high scale applications, as they can segment the data in a way that is fast to work with.

Range indexes support sorted pagination, so you can browse items by price, songs by artist, albums by year released, players by high score, etc. They are typically backed by something like a B-tree storage engine to allow for efficient scanning. This is fast on a single machine, but scalable systems must partition ranges to allow an individual index to grow beyond a single machine. As the number of partitions increases, the probability of a query encountering a slow-responding node goes up. This means developers of large scale systems should prefer secondary key lookup indexes to range queries when working with big data. But that isn’t always a tradeoff one can make.

FaunaDB offers a developer friendly way to combine the strengths of both approaches when indexing your data. Each record in FaunaDB can be indexed by term for fast scalable lookups. Within each term, records are sorted into ranges according to covered value. This allows your indexing scheme to evolve as your app grows. For small apps where ease of development is the primary concern, sorting all events by timestamp might make sense. But as your app grows, it could make more sense to group events by day, sorting by timestamp within each day. That way each day’s index is of a manageable fixed size, and older index data becomes read-mostly. These kinds of patterns are adopted by large scale apps the hard way, but doing this in FaunaDB would be as simple as using the event’s date’s day as the index term.

Every scalable database takes its own approach to offering developers control over index locality. I like the way FaunaDB balances giving control over data layout with a simple programmatic API that requires no administrative support to use, so even FaunaDB Cloud customers (or application developers connected to your FaunaDB Enterprise cluster) can control index layout in a scalable way.

Using Term Indexes

Here is an example of a simple term index definition in FaunaDB, looking up a user by email address. Note that this index includes a uniqueness constraint, so attempting to create multiple users with duplicate email addresses will fail:

q.CreateIndex(
    {
      name: "users_by_email",
      source: q.Class("users"),
      terms: [{ field: ["data", "email"] }],
      unique: true
    })

Here is a query using that index to load the user with address “foo@example.com”:

q.Paginate(q.Match(q.Index("users_by_email"), "foo@example.com"))

The result of this query is a reference to the user record with the matching email address. Because of the uniqueness constraint, there can only be one result. Other use cases, such as listing customers by postal code, could return multiple results for a term. Queries that return multiple results from a single term are also efficient, as the values within a term are stored together. In the next section we’ll talk about ordering results within a term.

Because terms are distributed evenly across the cluster, but the data for each term is stored together, this pattern avoids hot spots while keeping cluster crosstalk to a minimum. FaunaDB’s index partitioning spreads the load across the cluster, so there’s no limit to the number of results that a term can contain.

Using Range Queries

Here is an example of a range query using an index to load all blogs post by their publishing date. The index definition does not specify a term, so all records in the class are indexed under the class ref. At FaunaDB we like to call this a “class index” because it allows pagination of all members of the class. In this case, the covered values we’ll paginate over are stored in the “published_on” field. It’s also possible to omit the value specification as well as the term, in which case you can paginate over all members of the class.

q.CreateIndex(
    {
      name: "posts_by_date",
      source: q.Class("posts"),
      values: [{ field: ["data", "published_on"] }]
    })

To query this index for the most recent posts by date, we run this query to retrieve the last page by date. Note the synthetic cursor used as the second argument to Paginate, `{before: null}` refers to the page after the last page, so we’ll get the last page here, with a marker to the page before it.

q.Paginate(q.Match(q.Index("posts_by_date")), {before: null})

Our results look something like this:

{
  "before": [
    1502539900905543,
    q.Ref("classes/posts/175743538973114881")
  ],
  "after": [
    null
  ],
  "data": [
    1502539900905344,
    1502539900905545
  ]
}

If we use the new “before” cursor (a combination of a timestamp and a FaunaDB Ref) in place of `null` in the query we just issued, we’ll get the previous page of results. Requesting the page of older results looks like this:

q.Paginate(q.Match(q.Index("posts_by_date")), {before: [
    1502539900905543,
    q.Ref("classes/posts/175743538973114881")
  ]})

The before cursor comes directly from the previous response, and we can continue to chain requests, using each response’s “before” cursor. Recall that in this example we are browsing blog posts, most recent first. If we wanted to traverse the index in the standard order, we could make our first request without a synthetic cursor, and then use the “after” cursor to chain each page to the next. See the pagination documentation here for details.

In my development career, I’ve always found the best way to learn a new pagination API is to play with it by hand. To facilitate this, I find it’s easier to work with small pages. To set the page size smaller, try adding a `{size: 3}` option to your Paginate call, and then play around with synthetic cursors like {before: null} and {after: 0}.

Hybrid Term and Range Queries

FaunaDB’s power really shines when you combine both approaches, and break up a range index into logical segments. The FaunaDB index system groups items by term, and within a term sorts them by value. For instance, you can index all posts in a topic by date within their topic. That index definition would look like this:

q.CreateIndex(
    {
      name: "posts_by_date",
      source: q.Class("posts"),
      terms: [{ field: ["data", "topic"] }],
      values: [{ field: ["data", "published_on"] }]
})

This allows efficient retrieval of posts from any particular topic, and an unlimited number of topics. The number of posts in the popular topics won’t impact the performance of queries in the less popular topics, and the whole system scales fine no matter how many topics exist.

It’s also worth noting that both the terms and values of an index can be compound, pulling data from more than one field. Combined with the ability to specify some covered values to be indexed in reverse order, allows for complex real-world indexing schemes.

Other use cases for hybrid index styles include serving online catalogs, gathering features for machine learning, segmenting networks by city or neighborhood, browsing calendar event entries, etc.

If you made it this far, you’re probably ready to check out the reference documentation to the FaunaDB query API.