Distributed search with Apache Cassandra and/or Apache Solr?

One is a noSQL solution, the other one is primarily a search engine, each also adding up search or scaling functions thus increasing their potential use cases. Talking about the integration of Cassandra and Solr and how is best to perform a distributed search, we (Valentina Crisan & Radu Gheorghe) decided that this might be a good topic for one of Bucharest Big Data meetup presentations thus on June 7th we will debate the use cases in which these 2 powerful solutions might be best on their own or integrated.        

Below interview article is the base of the presentation and discussion in the upcoming June 7th Bucharest Big Data meetup: Distributed search with Apache Cassandra and/or Apache Solr.  

Best use cases for Cassandra + Solr/Lucene or Solr + Cassandra/noSQL?

[Valentina] From Cassandra perspective is pretty simple, any use case that needs as well – besides its main use cases  – queries with full-text search capabilities ( facets, highlights, geo-location…), also requiring one interface (CQL) for launching all queries, then Solr integration or any Lucene index will be the solution. The main data store would be Cassandra and the indexes would be handled by Lucene/Solr engine. Please also see below answer about Cassandra search capabilities.     

[Radu] Solr is great for either full text search (think books, blog posts) or structured search (products, logs, events) and for doing analytics on top of those search results (facets on an online shop, log/metric analytics, etc). Backups aside, you would benefit from a dependable datastore like Cassandra to be the source of truth. Also, some use-cases have big field values that don’t need to be shown with results (e.g. Email attachments), so they might as well be stored in Cassandra and only pulled when the user selects that result – leaving Solr with only the index to deal with for that field.

What are the optimal use cases for Apache Cassandra?

[Valentina]  When your requirement is to have very heavy write, distributed and highly scalable system and you might want as well to have quite responsive reporting system on top of that stored data, then Cassandra is your answer. Consider use case of Web analytics where log data is stored for each request and you want to built analytical platform around it to count hits by hour, by browser, by IP, etc in real time manner. Thumb rule of performing real time analytics is that you should have your data already calculated and should persist in the database. If you know the reports you want to show in real time, you can have your schema defined accordingly and generate your data at real time. Cassandra is not an OLAP system but can handle simple queries with a pre-defined data model and pre-aggregation of data, supporting integration with Spark for more complex analytics: joins, groupby,window functions.

Most of Cassandra uses cases are around storing data from next domains: Product Catalog/Playlist, Recommendation/Personalization Engine, Sensor Data/Internet of Things, Messaging, Fraud Detection.

Which are the Cassandra search capabilities?

[Valentina]  Traditionally Cassandra was not designed to support search cases (it can only do matches on partition key), but since v3.4 there has been some improvement, through the introduction of SASI feature (see below).

In general the standard option in Cassandra when dealing with search type of queries is to use an index, Cassandra supporting a pluggable interface for defining indexes, and thus there are multiple implementations available:

  • The native solution: SASI ( SSTable-Attached Secondary Index), implemented over the native secondary indexes in Cassandra, this feature is not only an extension of the native secondary indexes in Cassandra, but an improved one in the sense that shortens access time to data versus the standard indexes implementation. The are still some things to consider with SASI when compared with a search engine ( see Ref 1):
    • It is not possible to ask for total ordering of the result (even when LIMIT clause is used), SASI returns result in token range order;
    • SASI allows full text search, but there is no scoring applied to matched terms;
    • being a secondary index, SASI requires 2 passes on disk to fetch data: first to get the index data and another read in order to fetch the rest of the data;
    • native secondary indexes in Cassandra are locally stored in each node, thus all searches will need to hit all cluster nodes in order to retrieve the final data – even if sometimes this means only a node will have the relevant information;

  Through SASI Apache Cassandra supports:

– search queries including more than 1 column (multiple columns indexes)

– for data types (text, varchar & ascii) supports queries on:

– prefix using the LIKE ‘prefix%’ syntax

– suffix using the LIKE ‘%suffix’ syntax

– substring using the LIKE ‘%substring%’ syntax

exact match using equality (=)

– For other data types (int, date, uuid …) supports queries on:

– equality (=)

              – range ( <, ≤, >, ≥ )

  • Stratio’s Cassandra Lucene Index:  is an open plugin for Apache Cassandra that extends its index functionality to provide near real time search such as ElasticSearch or Solr, including full text search capabilities and free multivariable, geospatial search. It is achieved through an Apache Lucene based implementation of Cassandra secondary indexes, where each node of the cluster indexes its own data. Through Cassandra Lucene index next queries are supported (see Ref 2):
    • Full text search (language-aware analysis, wildcard, fuzzy, regexp)
    • Boolean search (and, or, not)
    • Sorting by relevance, column value, and distance
    • Geospatial indexing (points, lines, polygons and their multiparts)
    • Geospatial transformations (bounding box, buffer, centroid, convex hull, union, difference, intersection)
    • Geospatial operations (intersects, contains, is within)
    • Bitemporal search (valid and transaction time durations)
  • DSE ( Datastax Enterprise Search  – commercial edition): it’s a Datastax proprietary implementation ( you can get access to it only through DSE ) which uses Apache Solr and Lucene on top of Cassandra. It maintains search cores co-located with each Cassandra node, updating them as part of Cassandra’s write path so that data is indexed in real-time. In addition to supporting the text and geospatial search features, DSE Search also supports advanced capabilities like search faceting.
  • Other solutions: 
    • Of course you can integrate separately Cassandra and Solr (application layer integration) by using an analytical tool that joins the data from both systems and a solution like Kafka in order to synchronize the data. This has nothing to do with Cassandra search capabilities, just pointing the option. Spark also could be use as an aggregation and analytics layer for the data.       

Why would anyone, in your opinion, use Cassandra for search instead of just using Solr?

[Valentina]  The best implementation always depends on the use case and its history. If the primary use case is building an OLTP system, storing the data reliably, writing it really fast in distributed environments with simple queries that usually rely on the partition key and just sporadically support more complex queries: text search, joins, groupby, moving average, then having Cassandra as a base system and use it’s SASI or Lucene index or even DSE Solr implementation plus Spark for full analytics might be a solution. Depending on the search complexity you can decide on the 3 alternatives provided by Cassandra.

If your use case is a heavy analytics one that includes as well evolved search capabilities then having an analytics tool that combines the fast retrieval of data in Cassandra (queries by partition) and the indexes in Solr might be a solution.    

So, depends which is your primary use case for the solution you’re trying to build. For sure comparing Cassandra and Solr as noSQL or search engines is not optimal since Cassandra was not meant to be search engine and Solr a noSQL system ( although since SolrCloud there have been use cases for Solr as a data storage system) thus your main requirements should be the ones that drive the solution design not just a bare comparison between these 2.

How can Solr be used as a data storage system (besides a search engine)? What features are available for data redundancy and consistency?

[Radu] Solr supports CRUD operations, though updates tend to be expensive (see below) and a get by ID is in fact a search by ID: a very cheap search, but usually slower than getting documents by primary key from a more traditional storage system – especially when you want N documents back.

For redundancy, replicas can be created on demand and if leader goes down, a replica is automatically promoted. Searches are round-robined between replicas, and because replication is based on the transaction log (on SolrCloud at least), we’re talking about the same load no matter if a shard is a leader or a replica. Lastly, active-passive cross-DC replication is supported.

As for consistency, Zookeeper holds the cluster state, and writes are acknowledged when all replicas write to transaction log: the leader writes a new document to its transaction log and replicates it to all the replicas, which index the same document in parallel.

What are the preferred types of input data for Solr use cases (e.g. documents)? Which type of data is the most optimal for its functionality?  

[Radu]  Structured documents (e.g. products) and/or free text (books, events) are ideal use-cases. Solr doesn’t work well with huge documents (GB) or huge number of fields (tens of thousands).

Some solutions owners do have a separate storage system connected to Solr. How the integration looks in this case (all data is replicated twice or only the indexes reside in Solr)?

[Radu]  Usually Solr indexes data (for searching) and some fields have docValues (for sorting and facets). Most also store the original value for fields where docValues aren’t enabled (so one could get the whole document without leaving Solr). Storing a field is also required for highlighting. If you need all these features, you need to replicate the data twice (the worst case is times:separate storage, docValues, stored fields. Plus the inverted index). If you don’t, you can skip some of those, and rely on the separate storage to get the original data. That said, fetching data from a different storage is often slower because it requires another [network] call.

How does data update look like in Solr? Is the data updated or rather added up (with different versions)?

[Radu]  There are three main options:

  • Index a new document over the old one. Versioning can be used here for concurrency control.
  • Update a value of a document. This will trigger Solr to fetch the document internally, apply the change and index the resulting document over the old one. You can still use versions here, either assigned by Solr or provided by you (e.g. a timestamp).
  • For fields that only have docValues (not indexed, not stored), an in-place update of that value is possible.

When might be ok to use Solr as an analytics tool?

[Radu]  When you filter (search) on data before analyzing, because that’s where Solr shines. For example, live dashboards on logs, events, metric data. Or faceting for product search (e.g. Amazon).

What are Solr’s capabilities when it comes to full text search?

[Radu] I can’t be exhaustive here, but let me go over the main categories:

  • Fast retrieval of top N documents matching a search. It can also go over all matches of a search if a cursor is used.
  • Capable and pluggable text analysis. For example, language detection, stemming, synonyms, ngrams and so on. This is all meant to break text into tokens so that “analysis” in your query matches “analyze” in your document, for example.
  • Sorting by a field works, of course, but Solr is also good at ranking results by a number of criteria. For example, per-field boosting (including boosting more exact matches over more lenient ones, like those that are stemmed), TF/IDF or BM25, value of a field, potentially with a multiplier. Multiple criteria can be combined as well, so you can really tweak your relevancy scores.
  • Highlighting matches. It doesn’t have to go over the text again, which is useful for large fields like books.


This discussion will be be continued on June 7th, Bucharest Big Data Meetup.  See you there. 

Ref 1: http://www.doanduyhai.com/blog/?p=2058#sasi_perf_benchmarks

Ref 2: https://github.com/Stratio/cassandra-lucene-index 

Modeling your data for analytics with Apache Cassandra and Spark SQL

This session is intended for those looking to understand better how to model data for queries in Apache Cassandra and Apache Cassandra + Spark SQL. The session will help you understand the concept of secondary indexes and materialized views in Cassandra and the way Spark SQL can be used in conjunction with Cassandra in order to be able to run complex analytical queries. We assume you are familiar with Cassandra & Spark SQL (but it’s not mandatory since we will explain the basic concepts behind data modeling in Cassandra and Spark SQL). The whole workshop will be run in Cassandra Query Language and SQL and we will use Zeppelin as the interface towards Cassandra + Spark SQL.

Date: 10 June, 9:00 – 13:30 – this workshop will be rescheduled
Trainers: Felix Crisan, Valentina Crisan
Location: eSolutions Academy, Budişteanu Office Building, strada General Constantin Budişteanu Nr. 28C, etaj 1, Sector 1, Bucureşti.
Number of places:  15
Price: 150 RON (including VAT)

Check out the agenda and register for future session here.

Let’s talk BigQuery

Not all big data projects need a complex architecture and engineering team in order to start making sense of the data, so what should you do if you need to do some good old analysis and just want to get started right away? Assuming, for example, that you’re part of a small company, starting up a project and you need to analyze lots of data without spending additional time thinking of/planning the build of an architecture, hiring an architect / engineer, managing an infrastructure…, just need to see through your data and make sense of it. This is where Google’s BigQuery comes into play (of course there are many other potential uses but let’s stick with this for the moment). Called (a bit pretentiously maybe) an Enterprise Cloud Data Warehouse solution, thus scaring upfront many potential users in my opinion, in fact BigQuery is helping many to, at least, quick start their path in the Big Data world.

As part of the preparation of our next workshop, Data Analytics with BigQuery, we interviewed Gabriel Preda – trainer for the workshop but most importantly enthusiastic user of the solution for the last couple of years – to give us a glimpse of what we should expect from this solution.   

Why BigQuery, why did it made sense to you?

Usually in a startup each person wears more than one hat. You put the hat of the sysadmin…. you’re the sysadmin. Later you might need to wear the hat which says „innovation”… and start collecting GBs of daily data and of course process them in a timely fashion. Being short on people it was clear that we needed a SaaS solution.

In which use cases should we use BigQuery (analytical, data migration, cloud requirements)?

BigQuery is designed for OLAP (Online Analytical Processing) or BI. You should not use BigQuery for OLTP. Best use case for BigQuery are: ad hoc and trial-and- error interactive query of large dataset for quick analysis and troubleshooting.

Can you list the best fit scenarios for it?

I have used it successfully for in house analytics solutions. But I think it’s one of the best candidates on the market for data fishing because of it’s ability to perform ad hoc queries on large amount of data…

Is it more feasible to be used in projects where the data has been already natively stored in the cloud (e.g. Google Cloud Storage)?

Data transfer towards BigQuery is free. You might have some costs in transforming the data as there are some requirements on the data BigQuery can ingest. If you already have data in CSV, Avro (and soon Parquet) you can import them directly.

Which are the BigQuery alternatives/competitors?

I don’t know what to say about this… as it is quite a unique beast product!

Can you control where your data is, in case you have some requirements regarding location of your data?

You can choose between US and EU. But that is where it ends. Though there are some awesome news… there is an experimental extension to the BigQuery client that offers client-side encryption (Homomorphic encryption) for a subset of query types… that is: you can encrypt your data, upload encrypted data to BigQuery, run queries, fetch the results and decrypt them locally. It’s magic!

How you visualize the results of the analysis or the correlations of the data in BigQuery.

In the worst case scenario, when you can’t use the existing integrations, you can retrieve the results and use any visualization tool you are accustomed with. Now there are a lot of available integrations like: Tableau, Qlik, Talend, Informatica, SnapLogic or newcomers like Chartio or even free & open source BI tools like Metabase. There is also a Google solution (for now in beta) called Data Studio which covers more than BigQuery. I’ll do my best add details about Data Studio during the workshop.

Interview by: Valentina Crisan – bigdata.ro