This documents the process I have followed to create a Sahara cluster using CentOS based Hadoop2 images.
Creating the Image
The images were created on a RHEL6 machine with SELinux disabled by changing
/etc/selinx/config
to contain SELINUX=disabled
. This machine
has access to the Optional channel as well as EPEL.
The trunk of sahara-image-elements was used for image creation. The last commit
of the repo used was 28a76fd0c0e7b5431c26728fe60185d79d65eff6
. The last
commit of the diskimage-builder repo used by sahara-image-elements was
6b2a78f3abdcb7133ded96324f30907739f8f855
. I ran the following command to
create the image for testing:
$ sudo diskimage-create.sh -p hdp -i centos -v 2 -d
I realize the -i centos
isn’t strictly needed, but I wanted to be thorough
when testing the image creation.
NOTE There is an issue currently being resolved with diskimage-builder when creating CentOS images without the base element, for more information see bug 1308224.
Loading the Image
I am using the tip of Devstack trunk, installed as per
the “Quick Start” instructions. This is the local.conf
file I am using:
All of the following step were performed using the standard Horizon dashboard
interface in the Demo
project. I have registered a newly created keypair
that was created with $ ssh-keygen -t rsa
.
-
Create Images
Imported using the
Create Image
button from the project’sImages
tab. QCOW2 format selected, no architecture, minimum disk, or minimum ram were entered. -
Register Image
Image registered using the
Register Image
button from theSahara > Image Registry
tab. The user nameec2-user
was entered. -
Create Node Group Templates
Templates created using the
Create Template
button from theSahara > Node Group Templates
tab.For this cluster I have created 2 node group templates, a “master” node, and a “worker” node.
Both nodes use the
m1.small
OpenStack flavor, ephemeral drive storage location, and the public floating ip pool.The master node processes selected were;
NAMENODE
,SECONDAY_NAMENODE
,ZOOKEEPER_SERVER
,AMBARI_SERVER
,PIG
,HISTORYSERVER
,RESOURCEMANAGER
,NODEMANAGER
,OOZIE_SERVER
,GANGLIA_SERVER
, andNAGIOS_SERVER
.The worker node processes selected were;
DATANODE
,HDFS_CLIENT
,PIG
,MAPREDUCE2_CLIENT
,NODEMANAGER
, andOOZIE_CLIENT
. -
Create Cluster Template
Template created using the
Create Template
button from theSahara > Cluster Templates
tab.The template was created with 1 master node and 2 worker nodes.
-
Launching the Cluster
From the
Sahara > Cluster Templates
tab, I used theLaunch Cluster
button from the freshly created template. I used the previously mentioned image as the base, the registered keypair, and the private network for management.At this point the cluster will stay in the
Spawning
status for a few minutes, moving into theWaiting
state. In never seems to go pastWaiting
.
Debug Notes
I can log into the instances using ssh and the keypair, but only as root.
If I try to ssh in as ec2-user I get disconnected immediately. This is
resolved by setting SELinux to Permissive
on the instance.
All the nodes produce the same errors at the end of boot:
The JAVA_HOME error is being addressed in review 89515.
The Ambari server error seems to be based around the fact that these instances
do not have access to the internet and the ambari-server setup
command
seems to want to download a jdk image by default. If I run the setup from an
ssh shell I am able to select the jdk contained in /opt/jdk1.6.0_31
and the
setup will complete.
The worker nodes appear to have an improper server hostname in their
/etc/ambari-agent/conf/ambari-agent.ini
. They all contain localhost
for
the server hostname, this may be due to the server not configuring properly
but, if the value is changed to the IP for the configured server then their
agents run properly.
Even with all the ambari processing running the cluster does not leave
Waiting
status. There may be additional steps required to get the cluster
into a working state. This is still being investigated.
Updates
- 2014-04-25
Attaching the log files for the ambari-agent from the master node, and the
sahara log from the host machine. The file
/var/log/ambari-agent/ambari-agent.out
was empty as was the
/var/log/ambari-server
directory.
- 2014-05-01
Setting the proper floating ip configuration in Sahara allowed me to get past
the Waiting
status. This involved ensuring that the following were set
use_neutron=true
, use_floating_ips=true
, and use_namespaces=false
.
In the version of devstack I am using these are mostly preconfigured. In the past I had been able to use the namespaces setting but apparently that is not working from my devstack.
With these settings in place my cluster was able to move to the Preparing
status.
At this point I was able to ssh into the master node and run
ambari-server setup
as root. I chose the default options with the exception
of the JDK. For that I chose the Custom JDK
option and selected
/opt/jdk1.6.0_31
directory. This allowed the ambari-server to finish setting
up.
Next I ssh’d to the worker nodes and updated
/etc/ambari-agent/conf/ambari-agent.ini
to have the proper server address.
All the workers had localhost
as their server addresses.
With both of these fixes in place the cluster started moving along once again.
It became stopped in the Configuring
status.
Looking at the Sahara log files these is an exception happening with an attempt to get a repo file from the public internet. Here is the full exception stack:
- 2014-05-08
Made more progress in getting the cluster to configure itself with the following changes:
Changed to using the cloud-user
for logging into the instances as updates
to sahara-image-elements have changed the default user.
Commented out the install_rpms
call in the provision_ambari
method in
sahara/plugins/hdp/hadoopserver.py
. New code looks like:
Added a security group rule to allow ingress traffic on 8080 from 0.0.0.0/0.
The next problem issue was the Nagios server installation which failed
multiple times. Fortunately it seems that the cluster doesn’t necessarily need
it for creation. Removing the NAGIOS_SERVER
from the master node allowed the
cluster creation to make it further.
After all the tasks were initialized the next step in the plugin is to install
the Hadoop Swift integration. This attempts to download an rpm from Amazon S3
which fails in disconnected mode. The Hadoop Swift integration rpm is loaded
on the image during creation from sahara-image-elemnts, it can be found in
/opt/hdp-local-repos/hadoop-swift/
. It is not installed by default.
- 2014-05-09
I have added 2 patches to work around the remote installs. In the case of the
rpm installed during provision_ambari
, I have added a small piece to detect
if the rpm is already installed. For the Hadoop Swift integration piece I have
added a patch to detect the local version of the rpm and install that instead
of hitting the internet.
Here is a summary of the patches.
With these patches in place the cluster has progressed to the Starting
status.