Distributed search with Apache Cassandra and/or Apache Solr?

One is a noSQL solution, the other one is primarily a search engine, each also adding up search or scaling functions thus increasing their potential use cases. Talking about the integration of Cassandra and Solr and how is best to perform a distributed search, we (Valentina Crisan & Radu Gheorghe) decided that this might be a good topic for one of Bucharest Big Data meetup presentations thus on June 7th we will debate the use cases in which these 2 powerful solutions might be best on their own or integrated.        

Below interview article is the base of the presentation and discussion in the upcoming June 7th Bucharest Big Data meetup: Distributed search with Apache Cassandra and/or Apache Solr.  

Best use cases for Cassandra + Solr/Lucene or Solr + Cassandra/noSQL?

[Valentina] From Cassandra perspective is pretty simple, any use case that needs as well – besides its main use cases  – queries with full-text search capabilities ( facets, highlights, geo-location…), also requiring one interface (CQL) for launching all queries, then Solr integration or any Lucene index will be the solution. The main data store would be Cassandra and the indexes would be handled by Lucene/Solr engine. Please also see below answer about Cassandra search capabilities.     

[Radu] Solr is great for either full text search (think books, blog posts) or structured search (products, logs, events) and for doing analytics on top of those search results (facets on an online shop, log/metric analytics, etc). Backups aside, you would benefit from a dependable datastore like Cassandra to be the source of truth. Also, some use-cases have big field values that don’t need to be shown with results (e.g. Email attachments), so they might as well be stored in Cassandra and only pulled when the user selects that result – leaving Solr with only the index to deal with for that field.

What are the optimal use cases for Apache Cassandra?

[Valentina]  When your requirement is to have very heavy write, distributed and highly scalable system and you might want as well to have quite responsive reporting system on top of that stored data, then Cassandra is your answer. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. Thumb rule of performing real time analytics is that you should have your data already calculated and should persist in the database. If you know the reports you want to show in real time, you can have your schema defined accordingly and generate your data at real time. Cassandra is not an OLAP system but can handle simple queries with a pre-defined data model and pre-aggregation of data, supporting integration with Spark for more complex analytics: joins, groupby,window functions.

Most of Cassandra uses cases are around storing data from next domains: Product Catalog/Playlist, Recommendation/Personalization Engine, Sensor Data/Internet of Things, Messaging, Fraud Detection.

Which are the Cassandra search capabilities?

[Valentina]  Traditionally Cassandra was not designed to support search cases (it can only do matches on partition key), but since v3.4 there has been some improvement, through the introduction of SASI feature (see below).

In general the standard option in Cassandra when dealing with search type of queries is to use an index, Cassandra supporting a pluggable interface for defining indexes, and thus there are multiple implementations available:

  • The native solution: SASI ( SSTable-Attached Secondary Index), implemented over the native secondary indexes in Cassandra, this feature is not only an extension of the native secondary indexes in Cassandra, but an improved one in the sense that shortens access time to data versus the standard indexes implementation. The are still some things to consider with SASI when compared with a search engine ( see Ref 1):
    • It is not possible to ask for total ordering of the result (even when LIMIT clause is used), SASI returns result in token range order;
    • SASI allows full text search, but there is no scoring applied to matched terms;
    • being a secondary index, SASI requires 2 passes on disk to fetch data: first to get the index data and another read in order to fetch the rest of the data;
    • native secondary indexes in Cassandra are locally stored in each node, thus all searches will need to hit all cluster nodes in order to retrieve the final data – even if sometimes this means only a node will have the relevant information;

  Through SASI Apache Cassandra supports:

– search queries including more than 1 column (multiple columns indexes)

– for data types (text, varchar & ascii) supports queries on:

– prefix using the LIKE ‘prefix%’ syntax

– suffix using the LIKE ‘%suffix’ syntax

– substring using the LIKE ‘%substring%’ syntax

exact match using equality (=)

– For other data types (int, date, uuid …) supports queries on:

– equality (=)

              – range ( <, ≤, >, ≥ )

  • Stratio’s Cassandra Lucene Index:  is an open plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Through Cassandra Lucene index next queries are supported (see Ref 2):
    • Full text search (language-aware analysis, wildcard, fuzzy, regexp)
    • Boolean search (and, or, not)
    • Sorting by relevance, column value, and distance
    • Geospatial indexing (points, lines, polygons and their multiparts)
    • Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
    • Geospatial operations (intersects, contains, is within)
    • Bitemporal search (valid and transaction time durations)
  • DSE ( Datastax Enterprise Search  – commercial edition): it’s a Datastax proprietary implementation ( you can get access to it only through DSE ) which uses Apache Solr and Lucene on top of Cassandra. It maintains search cores co-located with each Cassandra node, updating them as part of Cassandra’s write path so that data is indexed in real-time. In addition to supporting the text and geospatial search features, DSE Search also supports advanced capabilities like search faceting.
  • Other solutions: 
    • Of course you can integrate separately Cassandra and Solr (application layer integration) by using an analytical tool that joins the data from both systems and a solution like Kafka in order to synchronize the data. This has nothing to do with Cassandra search capabilities, just pointing the option. Spark also could be use as an aggregation and analytics layer for the data.       

Why would anyone, in your opinion, use Cassandra for search instead of just using Solr?

[Valentina]  The best implementation always depends on the use case and its history. If the primary use case is building an OLTP system, storing the data reliably, writing it really fast in distributed environments with simple queries that usually rely on the partition key and just sporadically support more complex queries: text search, joins, groupby, moving average, then having Cassandra as a base system and use it’s SASI or Lucene index or even DSE Solr implementation plus Spark for full analytics might be a solution. Depending on the search complexity you can decide on the 3 alternatives provided by Cassandra.

If your use case is a heavy analytics one that includes as well evolved search capabilities then having an analytics tool that combines the fast retrieval of data in Cassandra (queries by partition) and the indexes in Solr might be a solution.    

So, depends which is your primary use case for the solution you’re trying to build. For sure comparing Cassandra and Solr as noSQL or search engines is not optimal since Cassandra was not meant to be search engine and Solr a noSQL system ( although since SolrCloud there have been use cases for Solr as a data storage system) thus your main requirements should be the ones that drive the solution design not just a bare comparison between these 2.

How can Solr be used as a data storage system (besides a search engine)? What features are available for data redundancy and consistency?

[Radu] Solr supports CRUD operations, though updates tend to be expensive (see below) and a get by ID is in fact a search by ID: a very cheap search, but usually slower than getting documents by primary key from a more traditional storage system – especially when you want N documents back.

For redundancy, replicas can be created on demand and if leader goes down, a replica is automatically promoted. Searches are round-robined between replicas, and because replication is based on the transaction log (on SolrCloud at least), we’re talking about the same load no matter if a shard is a leader or a replica. Lastly, active-passive cross-DC replication is supported.

As for consistency, Zookeeper holds the cluster state, and writes are acknowledged when all replicas write to transaction log: the leader writes a new document to its transaction log and replicates it to all the replicas, which index the same document in parallel.

What are the preferred types of input data for Solr use cases (e.g. documents)? Which type of data is the most optimal for its functionality?  

[Radu]  Structured documents (e.g. products) and/or free text (books, events) are ideal use-cases. Solr doesn’t work well with huge documents (GB) or huge number of fields (tens of thousands).

Some solutions owners do have a separate storage system connected to Solr. How the integration looks in this case (all data is replicated twice or only the indexes reside in Solr)?

[Radu]  Usually Solr indexes data (for searching) and some fields have docValues (for sorting and facets). Most also store the original value for fields where docValues aren’t enabled (so one could get the whole document without leaving Solr). Storing a field is also required for highlighting. If you need all these features, you need to replicate the data twice (the worst case is times:separate storage, docValues, stored fields. Plus the inverted index). If you don’t, you can skip some of those, and rely on the separate storage to get the original data. That said, fetching data from a different storage is often slower because it requires another [network] call.

How does data update look like in Solr? Is the data updated or rather added up (with different versions)?

[Radu]  There are three main options:

  • Index a new document over the old one. Versioning can be used here for concurrency control.
  • Update a value of a document. This will trigger Solr to fetch the document internally, apply the change and index the resulting document over the old one. You can still use versions here, either assigned by Solr or provided by you (e.g. a timestamp).
  • For fields that only have docValues (not indexed, not stored), an in-place update of that value is possible.

When might be ok to use Solr as an analytics tool?

[Radu]  When you filter (search) on data before analyzing, because that’s where Solr shines. For example, live dashboards on logs, events, metric data. Or faceting for product search (e.g. Amazon).

What are Solr’s capabilities when it comes to full text search?

[Radu] I can’t be exhaustive here, but let me go over the main categories:

  • Fast retrieval of top N documents matching a search. It can also go over all matches of a search if a cursor is used.
  • Capable and pluggable text analysis. For example, language detection, stemming, synonyms, ngrams and so on. This is all meant to break text into tokens so that “analysis” in your query matches “analyze” in your document, for example.
  • Sorting by a field works, of course, but Solr is also good at ranking results by a number of criteria. For example, per-field boosting (including boosting more exact matches over more lenient ones, like those that are stemmed), TF/IDF or BM25, value of a field, potentially with a multiplier. Multiple criteria can be combined as well, so you can really tweak your relevancy scores.
  • Highlighting matches. It doesn’t have to go over the text again, which is useful for large fields like books.

 

This discussion will be be continued on June 7th, Bucharest Big Data Meetup.  See you there. 

Ref 1: http://www.doanduyhai.com/blog/?p=2058#sasi_perf_benchmarks

Ref 2: https://github.com/Stratio/cassandra-lucene-index 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s