This documents the process I have followed to create a Sahara cluster using CentOS based Hadoop2 images.
Creating the Image
The images were created on a RHEL6 machine with SELinux disabled by changing
/etc/selinx/config to contain
SELINUX=disabled. This machine
has access to the Optional channel as well as EPEL.
The trunk of sahara-image-elements was used for image creation. The last commit
of the repo used was
28a76fd0c0e7b5431c26728fe60185d79d65eff6. The last
commit of the diskimage-builder repo used by sahara-image-elements was
6b2a78f3abdcb7133ded96324f30907739f8f855. I ran the following command to
create the image for testing:
$ sudo diskimage-create.sh -p hdp -i centos -v 2 -d
I realize the
-i centos isn’t strictly needed, but I wanted to be thorough
when testing the image creation.
NOTE There is an issue currently being resolved with diskimage-builder when creating CentOS images without the base element, for more information see bug 1308224.
Loading the Image
I am using the tip of Devstack trunk, installed as per
the “Quick Start” instructions. This is the
local.conf file I am using:
All of the following step were performed using the standard Horizon dashboard
interface in the
Demo project. I have registered a newly created keypair
that was created with
$ ssh-keygen -t rsa.
Imported using the
Create Imagebutton from the project’s
Imagestab. QCOW2 format selected, no architecture, minimum disk, or minimum ram were entered.
Image registered using the
Register Imagebutton from the
Sahara > Image Registrytab. The user name
Create Node Group Templates
Templates created using the
Create Templatebutton from the
Sahara > Node Group Templatestab.
For this cluster I have created 2 node group templates, a “master” node, and a “worker” node.
Both nodes use the
m1.smallOpenStack flavor, ephemeral drive storage location, and the public floating ip pool.
The master node processes selected were;
The worker node processes selected were;
Create Cluster Template
Template created using the
Create Templatebutton from the
Sahara > Cluster Templatestab.
The template was created with 1 master node and 2 worker nodes.
Launching the Cluster
Sahara > Cluster Templatestab, I used the
Launch Clusterbutton from the freshly created template. I used the previously mentioned image as the base, the registered keypair, and the private network for management.
At this point the cluster will stay in the
Spawningstatus for a few minutes, moving into the
Waitingstate. In never seems to go past
I can log into the instances using ssh and the keypair, but only as root.
If I try to ssh in as ec2-user I get disconnected immediately. This is
resolved by setting SELinux to
Permissive on the instance.
All the nodes produce the same errors at the end of boot:
The JAVA_HOME error is being addressed in review 89515.
The Ambari server error seems to be based around the fact that these instances
do not have access to the internet and the
ambari-server setup command
seems to want to download a jdk image by default. If I run the setup from an
ssh shell I am able to select the jdk contained in
/opt/jdk1.6.0_31 and the
setup will complete.
The worker nodes appear to have an improper server hostname in their
/etc/ambari-agent/conf/ambari-agent.ini. They all contain
the server hostname, this may be due to the server not configuring properly
but, if the value is changed to the IP for the configured server then their
agents run properly.
Even with all the ambari processing running the cluster does not leave
Waiting status. There may be additional steps required to get the cluster
into a working state. This is still being investigated.
Attaching the log files for the ambari-agent from the master node, and the
sahara log from the host machine. The file
/var/log/ambari-agent/ambari-agent.out was empty as was the
Setting the proper floating ip configuration in Sahara allowed me to get past
Waiting status. This involved ensuring that the following were set
In the version of devstack I am using these are mostly preconfigured. In the past I had been able to use the namespaces setting but apparently that is not working from my devstack.
With these settings in place my cluster was able to move to the
At this point I was able to ssh into the master node and run
ambari-server setup as root. I chose the default options with the exception
of the JDK. For that I chose the
Custom JDK option and selected
/opt/jdk1.6.0_31 directory. This allowed the ambari-server to finish setting
Next I ssh’d to the worker nodes and updated
/etc/ambari-agent/conf/ambari-agent.ini to have the proper server address.
All the workers had
localhost as their server addresses.
With both of these fixes in place the cluster started moving along once again.
It became stopped in the
Looking at the Sahara log files these is an exception happening with an attempt to get a repo file from the public internet. Here is the full exception stack:
Made more progress in getting the cluster to configure itself with the following changes:
Changed to using the
cloud-user for logging into the instances as updates
to sahara-image-elements have changed the default user.
Commented out the
install_rpms call in the
provision_ambari method in
sahara/plugins/hdp/hadoopserver.py. New code looks like:
Added a security group rule to allow ingress traffic on 8080 from 0.0.0.0/0.
The next problem issue was the Nagios server installation which failed
multiple times. Fortunately it seems that the cluster doesn’t necessarily need
it for creation. Removing the
NAGIOS_SERVER from the master node allowed the
cluster creation to make it further.
After all the tasks were initialized the next step in the plugin is to install
the Hadoop Swift integration. This attempts to download an rpm from Amazon S3
which fails in disconnected mode. The Hadoop Swift integration rpm is loaded
on the image during creation from sahara-image-elemnts, it can be found in
/opt/hdp-local-repos/hadoop-swift/. It is not installed by default.
I have added 2 patches to work around the remote installs. In the case of the
rpm installed during
provision_ambari, I have added a small piece to detect
if the rpm is already installed. For the Hadoop Swift integration piece I have
added a patch to detect the local version of the rpm and install that instead
of hitting the internet.
Here is a summary of the patches.
With these patches in place the cluster has progressed to the