Search NoSQLSolution

Situation so far: (5 weeks in)

1 December 2010

First Idea : SOLR

Why: We have used it before for another project. Solr 4 provides simple grouping in the form of group.field / group.sort. Solr 4 also provides faceted search and a really nice facet.pivot method that we can take advantage of to provide decision tree type information. It has a nice http interface. Solr supports sharding and replication for splitting up indexes (but there are quite a few gotchas)

Links :
general solr information : solr
solr field collapsing / grouping : Solr FieldCollapsing
solr facets : Solr Facets

How is that shaping up?:
Well that's a hard one to call. We have run into a lot of problems that we didn't anticipate (because we haven't tried to use solr with this many documents before) during indexing that don't seem to have easy answers.

Issues:

  1. With a churn of 25 percent (average) change in the documents (changes in price) we have an awful lot of updates.  Lucene really isn't designed for this kind of thing as it doesn't do updates in place - instead a document in the index is marked as deleted and a new one added, and a lookup table holds a list of documentid's to new documentids for the deleted list of documents. This means that the index grows as you update and eventually lucene needs to clean up the index (solr will eventually issue a commit with a purgeDeletes) which locks out the update thread. Also, for each index solr will only use 1 physical cpu core to do this job (if anyone reads this and can correct me on this please do). This means that the process that passes documents to solr needs to wait for it finish which leads to delay in the update queue. By sharding up the index to different cores and different machines we can mostly alleviate this problem and be up to date with data changes (to within 10 minutes)
  2. Query performance: we are rendering each page of our example website with one query. this query uses : a group (to get minimum price), 3 facet.pivots (for the origin tree [continent -> country -> region -> city -> airport], Destination tree and airline facets per destination). This is heavily cache dependant (which is a problem because of the number of commits we need to do. We've had to aggressively tune the caches and warm the searchers with sample queries to get adequate performance. Even so each query is still an aggregate on lots of documents so we are going to need to do a lot of performance testing (probably using apache bench)
  3. Sharding: We'd like to use a load balancing sharding setup (based on hash of documentID for example) but there are problems. First -- you can't aggregate over shards which is a big killer for us. Therefore sharding by some logical separation of the data is the plan -- splitting by origin country for example. This is the way we are going forward at the moment. It should at least give this the option of adding more replicated shards for the heavily used origin countries. On average this would give us a shard index size of around 8 million documents which seems quite manageable. The USA is a problem as we have 100+ million documents representing flights to deal with but we can probably split on USA-> USA flights and USA->Rest of the world flights as a starting point.
Alternatives:
  • I've been doing some research and I think that MongoDB sounds like a better fit for our requirement. It has a really nice sharding model and you can do some really funky things using map reduce to get aggregates. Also we could probably do the aggregates at commit time (every 10 mins or so) and then use the results to avoid having to aggregate on the fly. More about mongoDb in a future post.
 

Popular Posts