Hive plays well with ElasticSearch

Using the Amazon Elasticsearch Service with Hive

Amazon launched the Amazon Elasticsearch Service less than a month ago to enable their clients to spin up scalable Elasticsearch clusters directly from the AWS Management Console and forget about about managing these clusters by themselves. While you can spin up and use an Elasticsearch cluster in several minutes, this ease of use comes with a small disadvantage: as opposed to a classic Elasticsearch setup, the Elasticsearch service only exposes the publicly accessible client gateway, making it impossible for Hadoop applications to connect to the nodes behind this gateway using discovery mechanisms.

Hive and Elasticsearch

To connect to the ElasticSearch service from any popular Hadoop applications (Hive, Pig, Spark etc.) you need to use the Elasticsearch Hadoop connector. This can be imported into your Java/Scala application using build tools such as Maven and sbt respectively. To use the connector in Hive though, you need to download the standalone jar package available on the Elasticsearch website.

Note that support for the Amazon version (and other cloud based versions) of Elasticsearch was only added in Hadoop 2.2.0 beta 1. Older versions will not work with the restrictive environment of the cloud and you may experience one of the following errors when trying to connect to the Elasticsearch cluster, depending on your configuration:

  • Cannot find node with id
  • Cluster state volatile; cannot find node backing shards – please check whether your cluster is stable
  • Client-only routing specified but no client nodes with HTTP-enabled available
  • timeout

From this point on, I am assuming you are running a relatively fresh version of Hive on an EMR cluster and that you have already spun up an Elasticsearch service cluster with the correct access policy to allow EMR clusters to connect to it.

Once you downloaded the standalone JAR, make it available to your Hive cluster by storing it at an S3 location. For example, if I am hosting the file in the ana bucket with key emr/elasticsearch-hadoop-2.2.0-beta1.jar; I would call:

Now create a new external table with the schema that you want to push to ElasticSearch as follows:

By setting es.nodes.wan.only to true, the connector will disable discovery and its typical peer-to-peer connections and use only the nodes indicates in es.nodes. For Amazon this would be the publicly accessible gateway. Don’t forget to specify port 80 to override the default 9200. You can see an explanation of all other configuration options in the official documentation.

Disadvantages of the approach

Since all connections to and from Elasticsearch will be made through this gateway, I believe this solution disables the inherent parallelism of Hive and might be slower than connecting to clusters which have all the Elasticsearch nodes exposed. However, this is the only solution that works for now.

I did a test of uploading 200,000 documents from a 2 x c3.xlarge EMR cluster to a t2.micro Elasticsearch cluster and it took around 80 seconds. I am satisfied with the performance for now. I am planning to do a test with about 1 million documents in the following day and I will update this article.

References:

Ana Todor is a Computer Scientist with a playful and literary twist. She has a Bachelor of Science degree in Engineering, a Bachelor of Arts degree in Cultural Studies and a Master of Science degree in Computer Science, Digital Interactive Entertainment.

Leave a Reply

Next ArticleHappy birthday dear Romania!