May 6, 2014

Lucene vs. Solr vs. ElasticSearch vs. Sphinx ? How to do real-time full text search properly ?


Currently I am researching on the solution for a real-time search engine which scourge user submitted contents. The user pool consists of 250+ private institutions which translate into 10K-15K users. I/O will be relatively low, but the data size could be ranging wildly since the users are uploading Microsoft Words, Excel and PDF file.

The index is based on the user uploaded files (mostly are Word/Excel/PDF/PowerPoint, and ASCII files). The I/O is expected at only 10 IOPS -20 IOPS but it can vary depends on the date. Maximum I/O could be 100 IOPS. Current database size is reaching 10GB, it's 4 months old.

For real time search server, I'm considering Solr / Lucene and probably ElasticSearch. But the challenge is how to index these files FAST, so that search server can query the index in real time.

I have found some similar questions on how to index .doc/.xls/.pdf, but they did not mention how to ensure indexing performance:

How to build the index FAST ?

Any suggestion on the architecture ? Should I focus on building fast infrastructure (i.e. RAID, SSD, more CPU, Network bandwidth ?) or focus on the index tools & algorithm?

P/S: this question is asked on Stackoverflow as well

4 comments:

  1. I can't speak for Sphinx, but choose Elasticsearch w/ attachments plugin over Solr for scalability and real-time reqs.

    ReplyDelete
  2. Cool, I'll try it over the weekend and update this blog on the result.

    ReplyDelete
  3. I got Solr working already. Took me a week to get it running.
    During this period read alot of stuff around. Books, manuals.

    One thing you should notice, for your case: Apache Tika analyzers are well integrated within Solr.

    I am seriously considering ElasticSearch, though, because of positive community feedbacks, and my good impression about its docs, and supplementary technologies (See their website.)

    In any case, I'll have several months ahead of me to tune-up Search component for the Biz. And will make decisions as I go.

    ReplyDelete
  4. I had a look at Apache Tika API and the supported format - look promising there :)

    Let me know your experience with ElasticSearch. I'm also sold by the convenience in writing JSON query for ElasticSearch - it makes the test much easier.

    ReplyDelete