The Cult of Gary

24 Feb

Making PHP talk to Hive through Thrift

Hive is a data warehouse system build ontop of Hadoop. I’ve been experimenting with it for the past few days. Using the thrift service, I’ve been able to drive it from PHP. Here’s what I’ve done to get it going:

Launching a Cluster

Using the EC2 scripts, I launched a cluster of Hadoop servers on EC2. It’s straight forward to get up and running. It takes me about 5 minutes to get a cluster going, including boot time.

Once it’s up and I’ve connected to the master, I install some tools that I need for building:

yum install -y ant svn

I also set up my environment:


export JAVA_HOME=/usr/local/jdk1.6.0_10
export HADOOP_HOME=/usr/local/hadoop-0.19.0/
export DERBY_HOME=/usr/local/db-derby
alias h=$HADOOP_HOME/bin/hadoop

You may need to update version numbers as the images are updated.

Set Up Derby

You’ll want to use a stand alone metastore db. The default is to use an embedded version of derby, which locks the database files so you can only have one instance connecting to hive at a time.

You’ll need to set up and launch your derby server:


wget http://east.unified.net/apache/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
tar zxvf db-derby-10.4.2.0-bin.tar.gz
mv db-derby-10.4.2.0-bin $DERBY_HOME
pushd $DERBY_HOME
mkdir data
pushd data
nohup $DERBY_HOME/bin/startNetworkServer -h 0.0.0.0 &
popd
popd

Install Hive and the Thrift Service

The following chunk builds and installs the latest version of hive. It’s a simplified version of the Getting Started guide.


svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
pushd hive
ant package
mkdir /usr/local/hive
cp -r build/dist/* /usr/local/hive
popd
# setup required directories in hdfs
h fs -mkdir /tmp
h fs -mkdir /user/hive/warehouse
h fs -chmod g+w /tmp
h fs -chmod g+w /user/hive/warehouse
# copy the derby libs to connect to the server
cp $DERBY_HOME/lib/{derbyclient.jar,derbytools.jar} /usr/local/hive/lib/

I had a problem with the ant build process. My EC2 instances couldn’t download the hadoop source. My work around was to start the build on my mac and copy the ~/.ant directory to the server. Since this contains the complete ivy cache, it doesn’t need to download the files. I’m sure this was just a momentary glitch for me, but I figured I’d mention it in case someone else runs into it.

You’ll also need to update some hive config files. /usr/local/hive/conf/hive-site.xml should look like:


<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
<description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
</configuration>

And /usr/local/hive/conf/jpox.properties should contain:


javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine
org.jpox.cache.level2=true
org.jpox.cache.level2.type=SOFT

At this point, you can safely start hive by running:


pushd /usr/local/hive
nohup /usr/local/hive/bin/hive --service hiveserver &
popd

To start the interactive shell, run /usr/local/hive/bin/hive and execute something like show tables. If everything is working, you should see a metastore_db directory in /usr/local/db-derby/data.

Connecting From PHP

You’ll need to assemble the PHP thrift libraries to make this go. Until THRIFT-347 and HIVE-299 are resolved, it’s no easy task.

  1. To get the base thrift libraries, you need to download thrift. I recommend getting the latest SVN version over the release. The files you need are located a lib/php/src.
  2. Copy this directory somewhere — this is your THRIFT_ROOT.
  3. If the bug hasn’t been resolved, apply the patch from THRIFT-347 to TSocket.php.
  4. Build thrift. This was not a lot of fun for me. You need to install the boost-devel, gcc-g++, byacc, flex, autoconf and automake packages.
  5. You’ll need to build a bunch of thrift interfaces:

    mkdir $THRIFT_ROOT/packages/
    # FB303 thrift IF
    cd $YOUR_THRIFT_SRCDIR
    thrift --gen php contrib/fb303/if/fb303.thrift
    mv gen-php $THRIFT_ROOT/packages/fb303/
    # Hive Metastore thrift IF
    cd $YOUR_HIVE_SRCDIR/metastore
    thrift --gen php -I include if/hive_metastore.thrift
    mv gen-php $THRIFT_ROOT/packages/hive_metastore
    # Hive Service IF
    cd $YOUR_HIVE_SRCDIR/service/
    thrift --gen php -I include -I ../ if/hive_service.thrift
    mv gen-php $THRIFT_ROOT/packages/hive_service

That’s it! you should have a THRIFT_ROOT that’s ready to go. If you’re having troubles assembling your own version, post a comment and I can send you a copy of my working one.

Connecting from PHP

To use the API, you need to load a bunch of classes and set up a bunch of objects:


// set your THRIFT_ROOT to the location of your code
$GLOBALS['THRIFT_ROOT'] = 'thriftroot/';
// load the required files for connecting to Hive
require_once $GLOBALS['THRIFT_ROOT'] . 'packages/hive_service/ThriftHive.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'transport/TSocket.php';
require_once $GLOBALS['THRIFT_ROOT'] . 'protocol/TBinaryProtocol.php';
// Set up the transport/protocol/client
$transport = new TSocket('localhost', 10000);
$protocol = new TBinaryProtocol($transport);
$client = new ThriftHiveClient($protocol);
$transport->open();

Then you should be able to run queries:


$client->execute('SELECT * FROM some_table');
var_dump($client->fetchAll());

The Java Test Suite as a more in depth guide to using the API.

One Response to “Making PHP talk to Hive through Thrift”

  1. 1
    outsourced kitting and assembly Says:

    I searched on google and I had a hard time located the right info….until I found your blog.

Leave a Reply

© 2010 The Cult of Gary | Entries (RSS) and Comments (RSS)

GPSwordpress logo