Putting a full power search engine in Ecto

Posted: 2024-09-25T03:14:05Z
Updated: 2024-09-25T05:20:08Z
Changelog

Mosquito PR MK XVI

Preface: Existing options for search

I'd like to start by not-so-briefly outlining the existing approaches to text search and their use-cases, primarily around Postgres.

Basic string comparisons

These methods are ideal for searching short strings such as product codes, addresses, or names:

  • LIKE and ILIKE - Operators that provide basic string and sub-string matching.
  • Fuzzy matching / fuzzystrmatch - Including Soundex, Metaphone, and Levenstein Distance.
  • Trigrams / pg_tgrm - Measures the similarity of words based on 3-character segments.

They all have certain advantages when it comes to handling partial matches, misspellings, and phonetically similar names, but aren't suitable for searching longer text passages.

Postgres Full Text Search

Postgres' Full Text Search is best characterized by its convenience-to-capability ratio. It accomodates basic document retrieval without the need to synchronize or Extract-Transform-Load (ETL) your data to another service, but languishes in terms of capability and ranking of results.

Dedicated search engines

Dedicated search engines are standalone applications characterized by the following:

  • They implement BM25 text search and ranking, the gold standard for search results.
  • Support a variety of data types for querying, sorting, and analysis.
  • Make it reasonably straightforward to perform complex queries.
  • Maintain fast response times over massive datasets using specialized data structures.
  • Offer some sort of horizontal scaling or sharding.

Incumbent offerings in the search engine space include Apache Solr, Elasticsearch, and more recently OpenSearch. All of these are built on top of Apache Lucene and the JVM. A newer wave of search engines has also come about, built on Tantivy and Rust. Two prominent examples include Meilisearch (focuses on simplicity) and Quickwit (focused on logs and object storage backing).

A major drawback of search engines is that they must be kept synchronized with your database:

  • Your application must reflect any creation, update, or deletion in your database to the search engine.
  • Small and large data pushes to search engines often require separate code paths. Small updates call incremental procedures, whereas large updates often necessitate rebuilding search indexes entirely.
  • Dissonance often occurs when either system provides some query functionality you need to utilize in the same context.
  • You must manage deploying and hosting any search engine, in addition to your existing infrastructure.

Introducing ParadeDB (in Ecto)

ParadeDB is a set of extensions that add pretty amazing search and analytics features to Postgres. In particular, ParadeDB embeds Tantivy as an extension via pgrx.

A few weeks ago I began exploring what it’d take to support ParadeDB in Ecto. My primary goal was to figure out how much work it'd take, and what the developer experience would feel like. I've created an initial implementation and sample project to demonstrate. The sample project contains transcripts of publicly available police scanner calls generated by OpenAI's Whisper.

A simple search query for all calls with the word "drive" or "driving" in them would look like this:

from(
  c in Call,
  search: parse(c, "transcript:drive")
) |> Repo.all()

The above example uses ParadeDB's mini query language, which can be accessed via parse/2. More advanced and structured queries are also available.

We could limit our search query to calls between 5 and 10 seconds long, and retrieve fewer fields using Ecto's existing select/3:

from(
  c in Call,
  search: parse(c, "transcript:drive"),
  search: int4range(c.call_length, 5, 10, "[]"),
  select: {c.id, c.transcript}
) |> Repo.all()
[
  {423,
   "Ambulance 725 probably was responding interperson from a fall. PLS, 15,320 Pine Orchard Drive, unit number 1 at Foxtrot."},
  {172,
   "Charlie here for truck inspection. We have on track transportation driving a white van with no logo. This is the first time. Please look out for us."},
  {37,
   "Morning, Sam. Go ahead. I have a hit and run on Woodfield and Deanna Drive and the Mary 2. I only have Mary 33 right now. All right. You can hold it."}
]

For a more maximal demonstration, here's a rather complex query:

slop = 1

query =
  from(
    c in Call,
    search: boolean([
      must: [
        parse(c, "transcript:ambulance"),
        disjunction_max([
          boost(parse(c, "transcript:(BLS ALS)"), 10),
          parse(c, "transcript:(fall)"),
          phrase(c.transcript, ["routine", "response"], ^slop)
        ])
      ],
    ])
  )

Repo.all(query)

This query will:

  • Return calls with the word "ambulance" in them
  • Include at least one of: "ALS", "BLS", "fall", or "routine response" within a word of one another.
  • Rank calls that mention "ALS" or "BLS" higher, having them appear first.
  • Map the results to %Call{} structs as like any other Ecto query.

To show how the search query can compose with the rest of SQL and Ecto, let's narrow the results somewhat to suit this post:

query
|> join(:inner, [c], tg in assoc(c, :talk_group))
|> where([_, tg], ilike(tg.tag, "%dispatch%"))
#  work-alike to: where([c, _], c.call_length <= 20)
|> search([c, _], int4range(c.call_length, nil, 20, "(]"))
|> select([c, tg], %{
  id: c.id,
  text: c.transcript,
  duration: c.call_length,
  talk_group: tg.description
})
|> Repo.all()
[
  %{
    id: 170,
    text: "Trader Joe's, 6831 Wisconsin Avenue, Suite 400, Cross Street, Stanford Street, Decrease LLC, ALS 1, Paramedic in 706, Ambulance 741, Bravo, Response 7, Alpha 1, Box 061, 1524.",
    duration: 13,
    talk_group: "7A2 Dispatch"
  },
  %{
    id: 375,
    text: "Are the mailboxes 4302 Randolph Road, Crosstown, Veras, Miller Road, Interperson, Firm of Fall, BOS, Ambulance 705, Response 701, Vox Air 21-2, 1545.",
    duration: 11,
    talk_group: "7A2 Dispatch"
  }
]

If you're absolutely curious what the original audio sounds like, here you are: 170 and 375. Be aware that they start with loud alert tones.

Overall I'm quite pleased with the results thus far!

  • Search indexes are handily created in migration files. ParadeDB transparently updates the search index when rows are inserted, updated, or deleted, while retaining ACID compliance. This obviates the need to maintain any synchronization or ETL pipeline between your database and search service.
  • Search queries compose quite fluently with the rest of SQL. This flexibility does introduce some performance considerations, but it's nice to even have the discretion available.
  • Ecto provides tons of features search engine clients often lack, providing an unparalleled (to my knowledge) developer experience.
  • Although not demonstrated here, it's possible to perform search queries across separate indexes and JOIN them as you would any other SQL query.

Current implementation and next steps

The current Ecto implementation needs a lot more work before it’s production-ready. ParadeDB already has several especially lucrative features that I’ve yet to expose:

  • BM25 scores can be returned from ranked queries.
  • Searches also support hybrid search when used with pg_vector and AI generated embeddings.
  • Aggregates and facets are available, although limited to enterprise customers for now. (This will probably change at some point)
  • ParadeDB 0.10.0 just dropped, and adds lots of optimizations and features I've not been able to touch yet.

There’s also lots more getting the implementation 'generally correct', well tested, and working with all of Ecto’s existing features.

Looking forward

My tentative plan's to continue working on this with the goal of maintaining a downstream fork of :ecto and :ecto_sql others can use. That said, I’m just one person, so it’ll probably be a while before I’m able to fully realize that. If any folks are interested in contributing to accelerate the process, drop a PR, or message me on the Elixir Slack or Discord, @Moosieus.

ParadeDB itself is only about a year old, so it's early days yet. The core team's small but talented, and they've fostered a community with lots of active contributions. Tantivy itself is also seeing continued improvement which will benefit ParadeDB in turn.

The code I'm working on is located here: moosieus/ecto and moosieus/ecto_sql.

Q&A

Why not just use fragments?

ParadeDB introduces the need to compose its own queries within Ecto, which is a more comprehensive problem than fragments can address.

Doesn't Postgrex support extensions?

Postgrex extensions allow users to encode and decode additional data types. ParadeDB's API uses Postgres' existing data types, so the extension API isn't of use here.

Wouldn't a fork not get updates from Ecto?

It'd be a downstream fork, meaning changes from Ecto would be merged in and published regularly. The versions might have to look a bit funny though, something like 3.10.2-paradedb-v1. I haven't fully thought this part out.