Accumulo® on AWS’s Elastic MapReduce

One of the great things about working for a small company is that you never know which technology you might need to work on. We don’t have a database team or a network team or a hardware team (we don’t have any hardware!), there are just a few of us doing everything.

The latest thing to drop in my lap is Gaffer. Gaffer is a piece of software written by my government. Gaffer runs on Accumulo ®. Accumulo runs on Hadoop® and needs Zookeeper™ to run.

One of the great things about the Amazon AWS infrastructure is their SaaS offerings. In this case “Elastic MapReduce” (EMR). This provides a very quick way to stand up a cluster of servers ready with Hadoop and other optional software.

Accumulo is not an application that Amazon will install for you. There is a guide Running Accumulo on EMR. Written by Amazon consultants almost 2 years ago. Needless to say, it has not aged well and has sadly not been maintained. The script they reference is no longer available in S3, I found a copy on Github. The instructions they provide to build a cluster are no longer applicable, there are extra parameters required and the instance type they suggest is no longer supported.

I thought, that it would be a useful contribution to bring the Accumulo install up to date and add in the Gaffer setup.

Technical Details

I felt that the best way to proceed was build this up a step at a time.

Building the cluster at 1.6.1

To build the cluster, you now need to use an m4.large instance type, which needs to run in a VPC. AWS build you a default VPC now, however, that means you need to specify security groups and subnet that you previously didn’t need. Those will need to be created before you can build your cluster.

If you are starting from scratch, you can shortcut the manual creation of the groups and subnet by creating a cluster with the AWS GUI. After you create a cluster, there is an “AWS CLI Export” button that lets you see the command used to create the cluster. From there, you can extract the subnet and security group IDs. I recommend that you add SSH permission to your own IP address while you’re in the console. It makes your cluster accessible for commands later.
(Allow the cluster creation to complete and then terminate it to avoid paying for something you don’t need. You will get charged for an hour of the instance size you requested * number of instances in your cluster. Use spot pricing to bring that price down if you’re on a tight budget.)

You also need to specify some roles which were not required when the original command was written.

The command now looks like this :

aws emr create-cluster --name Accumulo --no-auto-terminate \
   --bootstrap-actions Path=s3://BUCKET/1.6.1/install-accumulo_mj,Name=Install_Accumulo \
   --applications Name=Hadoop Name=ZooKeeper \
   --release-label emr-5.3.0 \
   --instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=m4.large,BidPrice=0.05 \
                     InstanceGroupType=CORE,InstanceCount=2,InstanceType=m4.large,BidPrice=0.05 \
   --ec2-attributes KeyName=KEY,InstanceProfile=EMR_EC2_DefaultRole,EmrManagedSlaveSecurityGroup=sg-f1444444,EmrManagedMasterSecurityGroup=sg-fe444444,SubnetId=subnet-5000000a \
   --service-role EMR_DefaultRole

 

You will need to adjust the BUCKET to the bucket containing your install script, KEY to your SSH key and the security group/subnet IDs to your own values. You may want to remove the “BidPrice=0.05” if cost isn’t an issue. And don’t forget to add SSH to the security groups.

If you run that command now, it will fail, because you haven’t yet sorted out the install-accumulo_mj script.

You can get that script from my fork of the AWS repository here.

The latest version of Accumulo

Now I have some idea how to install Accumulo, I can set about the script and rewrite it in a way that makes me a bit happier. Ultimately, I’d consider replacing everything with Ansible, but for now I just want something that works.

You can find that script in my repository here . It’s had quite a bit of rewrite, I hoped to make it clearer which bits were run as hadoop, root or accumulo without a total rewrite. I think I failed. If we progress this work, I will probably rewrite this whole thing with Ansible. Ansible will set up your subnets, security groups, etc, then build the cluster without a bootstrap, and then use a slightly less loopy script to install Accumulo etc..

Upload that new script to your bucket, change the location of the bootstrap file in your “aws emr” command to the new file, run the command and wait for your cluster to build.

What if I already have a Hadoop cluster?

The script should be runnable as the hadoop user on any cluster with at least Zookeeper and HDFS. It does some slightly funky stuff to make it run after Hadoop is running on EMR, but as long as you have sudo, you should be able to run this script. Unfortunately, you’ll need to run it on each node in your cluster yourself.

I had some issues with the CLASSPATH, which you may need to resolve by changing the “accumulo-site.xml” file or some of the paths in the script. If the accumulo binary fails to run, look for class path problems in the output.

Adding Gaffer

You’ll be glad to know I already did this for you. In the bootstrap script you will see the following. If you just want Accumulo, you can safely remove these lines :

wget -q 'https://search.maven.org/remotecontent?filepath=uk/gov/gchq/gaffer/accumulo-store/0.6.0/accumulo-store-0.6.0-iterators.jar' -O accumulo-store-0.6.0-iterators.jar


The wget will download the jar onto all nodes. On the master, we create the user and table for Gaffer. You may want to customise the user’s details :

\${ACCUMULO_HOME}/bin/accumulo shell --user root -p secret << EOF
createuser myUser
myPassword
myPassword
grant -s System.CREATE_TABLE -u myUser
user myUser
myPassword
createtable gafferTable
EOF     

 

I believe you can remove the create table permission once the table is created, but if we’re doing development work, other tables might be needed. For “production” you probably should remove the permission (and choose a much better password!)

Then, you need to supply a .properties file to your developers that looks like :

gaffer.store.class=uk.gov.gchq.gaffer.accumulostore.AccumuloStore
gaffer.store.properties.class=uk.gov.gchq.gaffer.accumulostore.AccumuloProperties
accumulo.instance=instance
accumulo.zookeepers=172.1.1.1:2181,172.1.1.2:2181
accumulo.table=gafferTable
accumulo.user=myUser
accumulo.password=myPassword

 

(Replace those IP addresses with your actual Zookeeper nodes address). There is no automation for this step, just copy/paste/edit that chunk of text. If you need to automate it, you can get the install script to write all that data out fairly easily. Then you still need to copy/paste the file into wherever your application is.

Testing Gaffer

Once you’ve done all that, you probably want to know if it worked or not. I suggest you pull the example code from Github. You can run this from anywhere with Java 1.8 and Maven installed that can make a network connection to your Zookeeper server.

git clone https://github.com/gchq/Gaffer.git
cd Gaffer/example
vi ./example-graph/src/main/resources/example/gettingstarted/mockaccumulostore.properties
mvn clean install -Pquick -PexampleJar

 

On the 3rd line, you use vi to replace the content of the mock store properties file with your real properties as above. If you want to rename it to remove the “mock” then you need to start changing source code, I don’t recommend bothering unless you’re good with Java and want to do more than this one quick example. Maven takes a few minutes to build. Then to run an example :

java -cp example/example-graph/target/example-jar-with-dependencies.jar uk.gov.gchq.gaffer.example.gettingstarted.analytic.LoadAndQuery1

 

That should show you a whole load of helpful debug about which server it is connecting to and then insert and select some edges from the database. You can check that it really did put data in the database by using the Accumulo shell :

accumulo/bin/accumulo shell --user myUser -p myPassword
table gafferTable
scan

And if that all worked and you see a load of rows, remove them with :

deletemany -f

 

Congratulations, you have just installed a Gaffer graph database.