The Cult of Gary

04 Mar

UPS Saves the Day

I’m coaching the Simon Fraser University Ultimate Frisbee team this year. We’re leaving for the Stanford Invite tomorrow. Our brand new jerseys were scheduled to arrive at my place today. I’ve been paranoid about getting these in on time. In fact, I was watching the UPS tracking page all morning long.

I when I checked it around 11:30 I saw that they had attemped a delivery at 11:22.
picture-4

For some reason, my door buzzer doesn’t work reliably and my phone didn’t go off. I had missd him by minutes. My apartment is on the ground floor, right next to the entrance way. He would have been no more than 15 feet away from me when he attempted the delivery.

I paniced for a minute, then hunted for the UPS phone number. After weeding through their IVR and getting a live person (wasn’t hard — I just pressed 0), I explained the situation. She opened a ticket and said I’d be called back within an hour.

I got my callback 15 minutes later. They managed to get the driver to come back and attempt another delivery. The jerseys were delivered at about 12:30. UPS saved the day. We’ll actually look like a team at the tournament.

24 Feb

Making PHP talk to Hive through Thrift

Hive is a data warehouse system build ontop of Hadoop. I’ve been experimenting with it for the past few days. Using the thrift service, I’ve been able to drive it from PHP. Here’s what I’ve done to get it going:

Launching a Cluster

Using the EC2 scripts, I launched a cluster of Hadoop servers on EC2. It’s straight forward to get up and running. It takes me about 5 minutes to get a cluster going, including boot time.

Once it’s up and I’ve connected to the master, I install some tools that I need for building:

yum install -y ant svn

I also set up my environment:


export JAVA_HOME=/usr/local/jdk1.6.0_10
export HADOOP_HOME=/usr/local/hadoop-0.19.0/
export DERBY_HOME=/usr/local/db-derby
alias h=$HADOOP_HOME/bin/hadoop

You may need to update version numbers as the images are updated.

Set Up Derby

You’ll want to use a stand alone metastore db. The default is to use an embedded version of derby, which locks the database files so you can only have one instance connecting to hive at a time.

You’ll need to set up and launch your derby server:


wget http://east.unified.net/apache/db/derby/db-derby-10.4.2.0/db-derby-10.4.2.0-bin.tar.gz
tar zxvf db-derby-10.4.2.0-bin.tar.gz
mv db-derby-10.4.2.0-bin $DERBY_HOME
pushd $DERBY_HOME
mkdir data
pushd data
nohup $DERBY_HOME/bin/startNetworkServer -h 0.0.0.0 &
popd
popd

Install Hive and the Thrift Service

The following chunk builds and installs the latest version of hive. It’s a simplified version of the Getting Started guide.


svn co http://svn.apache.org/repos/asf/hadoop/hive/trunk hive
pushd hive
ant package
mkdir /usr/local/hive
cp -r build/dist/* /usr/local/hive
popd
# setup required directories in hdfs
h fs -mkdir /tmp
h fs -mkdir /user/hive/warehouse
h fs -chmod g+w /tmp
h fs -chmod g+w /user/hive/warehouse
# copy the derby libs to connect to the server
cp $DERBY_HOME/lib/{derbyclient.jar,derbytools.jar} /usr/local/hive/lib/

I had a problem with the ant build process. My EC2 instances couldn’t download the hadoop source. My work around was to start the build on my mac and copy the ~/.ant directory to the server. Since this contains the complete ivy cache, it doesn’t need to download the files. I’m sure this was just a momentary glitch for me, but I figured I’d mention it in case someone else runs into it.

You’ll also need to update some hive config files. /usr/local/hive/conf/hive-site.xml should look like:


<configuration>
<property>
<name>hive.metastore.local</name>
<value>true</value>
<description>controls whether to connect to remove metastore server or open a new metastore server in Hive Client JVM</description>
</property>
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:derby://localhost:1527/metastore_db;create=true</value>
<description>JDBC connect string for a JDBC metastore</description>
</property>
<property>
<name>javax.jdo.option.ConnectionDriverName</name>
<value>org.apache.derby.jdbc.ClientDriver</value>
<description>Driver class name for a JDBC metastore</description>
</property>
</configuration>

And /usr/local/hive/conf/jpox.properties should contain:


javax.jdo.PersistenceManagerFactoryClass=org.jpox.PersistenceManagerFactoryImpl
org.jpox.autoCreateSchema=false
org.jpox.validateTables=false
org.jpox.validateColumns=false
org.jpox.validateConstraints=false
org.jpox.storeManagerType=rdbms
org.jpox.autoCreateSchema=true
org.jpox.autoStartMechanismMode=checked
org.jpox.transactionIsolation=read_committed
javax.jdo.option.DetachAllOnCommit=true
javax.jdo.option.NontransactionalRead=true
javax.jdo.option.ConnectionDriverName=org.apache.derby.jdbc.ClientDriver
javax.jdo.option.ConnectionURL=jdbc:derby://localhost:1527/metastore_db;create=true
javax.jdo.option.ConnectionUserName=APP
javax.jdo.option.ConnectionPassword=mine
org.jpox.cache.level2=true
org.jpox.cache.level2.type=SOFT

At this point, you can safely start hive by running:


pushd /usr/local/hive
nohup /usr/local/hive/bin/hive --service hiveserver &
popd

To start the interactive shell, run /usr/local/hive/bin/hive and execute something like show tables. If everything is working, you should see a metastore_db directory in /usr/local/db-derby/data.

Connecting From PHP

You’ll need to assemble the PHP thrift libraries to make this go. Until THRIFT-347 and HIVE-299 are resolved, it’s no easy task.

  1. To get the base thrift libraries, you need to download thrift. I recommend getting the latest SVN version over the release. The files you need are located a lib/php/src.
  2. Copy this directory somewhere — this is your THRIFT_ROOT.
  3. If the bug hasn’t been resolved, apply the patch from THRIFT-347 to TSocket.php.
  4. Build thrift. This was not a lot of fun for me. You need to install the boost-devel, gcc-g++, byacc, flex, autoconf and automake packages.
  5. You’ll need to build a bunch of thrift interfaces:

    mkdir $THRIFT_ROOT/packages/
    # FB303 thrift IF
    cd $YOUR_THRIFT_SRCDIR
    thrift --gen php contrib/fb303/if/fb303.thrift
    mv gen-php $THRIFT_ROOT/packages/fb303/
    # Hive Metastore thrift IF
    cd $YOUR_HIVE_SRCDIR/metastore
    thrift --gen php -I include if/hive_metastore.thrift
    mv gen-php $THRIFT_ROOT/packages/hive_metastore
    # Hive Service IF
    cd $YOUR_HIVE_SRCDIR/service/
    thrift --gen php -I include -I ../ if/hive_service.thrift
    mv gen-php $THRIFT_ROOT/packages/hive_service

That’s it! you should have a THRIFT_ROOT that’s ready to go. If you’re having troubles assembling your own version, post a comment and I can send you a copy of my working one.

Connecting from PHP

To use the API, you need to load a bunch of classes and set up a bunch of objects:


// set your THRIFT_ROOT to the location of your code
$GLOBALS['THRIFT_ROOT'] = ‘thriftroot/’;
// load the required files for connecting to Hive
require_once $GLOBALS['THRIFT_ROOT'] . ‘packages/hive_service/ThriftHive.php’;
require_once $GLOBALS['THRIFT_ROOT'] . ‘transport/TSocket.php’;
require_once $GLOBALS['THRIFT_ROOT'] . ‘protocol/TBinaryProtocol.php’;
// Set up the transport/protocol/client
$transport = new TSocket(’localhost’, 10000);
$protocol = new TBinaryProtocol($transport);
$client = new ThriftHiveClient($protocol);
$transport->open();

Then you should be able to run queries:


$client->execute('SELECT * FROM some_table');
var_dump($client->fetchAll());

The Java Test Suite as a more in depth guide to using the API.

18 Feb

Hadoop HDFS: Space Available

I’ve been playing around with Hadoop recently. It’s pretty slick. The EC2 scripts make it incredibly easy to set up and work with. HDFS is pretty neat too. I’ve been working with 10GB data sets and moving the data around and working with them isn’t painful.

I was curious as to how much free space was available in my cluster. It took a bit of digging around to figure out how to get this information, so I figured I’d post it here.

Running bin/hadoop dfsadmin -report will give a breakdown of your node storage:

[root@ip-10-250-147-64 ~]# h dfsadmin -report
Configured Capacity: 718626299904 (669.27 GB)
Present Capacity: 681713795072 (634.9 GB)
DFS Remaining: 681713745920 (634.9 GB)
DFS Used: 49152 (48 KB)
DFS Used%: 0%
-------------------------------------------------
Datanodes available: 2 (2 total, 0 dead)
Name: 10.250.106.176:50010
Decommission Status : Normal
Configured Capacity: 359313149952 (334.64 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 18456252416 (17.19 GB)
DFS Remaining: 340856872960(317.45 GB)
DFS Used%: 0%
DFS Remaining%: 94.86%
Last contact: Wed Feb 18 12:28:16 EST 2009
Name: 10.250.146.127:50010
Decommission Status : Normal
Configured Capacity: 359313149952 (334.64 GB)
DFS Used: 24576 (24 KB)
Non DFS Used: 18456252416 (17.19 GB)
DFS Remaining: 340856872960(317.45 GB)
DFS Used%: 0%
DFS Remaining%: 94.86%
Last contact: Wed Feb 18 12:28:18 EST 2009
13 Feb

I’ve Never Used a Phone Book in My Life

Darren Barefoot has a post about Yellow Pages (phone books not NIS).

A few years back my wife and I moved from a basement suite we were renting to our current apartment. Since we had never gotten our own phone book, I remember the first time the yellow pages showed up.

One day, I was packing in some groceries and I grabbed one from the entrance foyer. I put it on the table that we stored our keys on. It was the same place my parents kept theirs.. I did it out of habit.

As soon as I put it down, my wife picked it up and took it back to the entrance way. I started to protest, but then I realized that I’ve never looked in a phone book in my life. I’m 27 and I’ve always had the internet to look up businesses and phone numbers when I needed them.

There’s always a pile of the directories in the garbage room a week after they’ve been delivered. I’d wager 80% of them end up there in my building.

There should be a law that YP has to collect the unclaimed books — that would motivate them to make the list opt-in.

08 Jan

Time Machine, Netatalk and Error Code -6602

Over the Christmas break, I set time machine to backup my laptop to a USB drive attached to my Mythbuntu backend server. There are plenty of instructions out there on how to do this. I used this one.

The machine I’m using as the server is old. It doesn’t have onboard USB 2.0 ports, so the backups were really slow. I grabbed a USB 2.0 card to try and speed up the backups. The long and the short of this is that I managed to hit a 2 year old USB mass storage bug.

In my attempt to get around the bug, I tried upgrading the server from Ubuntu 7.10 to 8.10. The upgrade went well, though it didn’t solve my USB problem.

As a side effect of the upgrade, my laptop wasn’t able to mount the backups any longer. I was getting a -6602 error from Finder. Googling for the error message pointed towards Samba connection problems.

The actual problem turned out to be a change in the Berkeley DB library after the upgrade:

Jan  8 19:23:24 mythtv afpd[28202]: CNID DB initialized using Berkeley DB 4.6.21: (September 27, 2007)
Jan  8 19:23:24 mythtv afpd[28202]: cnid_open: dbenv->open (rw) of /backups/gary/.AppleDB failed: DB_VERSION_MISMATCH: Database environment version mismatch
Jan  8 19:23:24 mythtv afpd[28202]: cnid_open: dbenv->open of /backups/gary/.AppleDB failed: DB_VERSION_MISMATCH: Database environment version mismatch
Jan  8 19:23:24 mythtv afpd[28202]: Cannot open CNID db at [/backups/gary].

The final solution was to remove the old .AppleDB directories and restart netatalk.

02 Jan

Ack! An EC2 Instance has Died!

And it was the one my blog was running on.

It was the damnest thing too. I was able to reboot it from the API and look at the console output. As far as I can tell, the network adapter wasn’t able to DHCP an IP address:


Welcome to CentOS release 5 (Final)
Press 'I' to enter interactive startup.
Setting clock : Mon Dec 22 20:06:28 EST 2008 [ OK ]
Starting udev: [ OK ]
Setting hostname localhost.localdomain: [ OK ]
No devices found
Setting up Logical Volume Management: No volume groups found
[ OK ]
Checking filesystems
Checking all file systems.
[/sbin/fsck.ext3 (1) -- /] fsck.ext3 -a /dev/sda1
/dev/sda1: clean, 96119/1313280 files, 616984/2621440 blocks
[/sbin/fsck.ext3 (1) -- /mnt] fsck.ext3 -a /dev/sda2
/dev/sda2: clean, 5853/19546112 files, 892921/39092224 blocks
[ OK ]
Remounting root filesystem in read-write mode: [ OK ]
Mounting local filesystems: [ OK ]
Enabling local filesystem quotas: [ OK ]
Enabling /etc/fstab swaps: [ OK ]
INIT: Entering runlevel: 4
Entering non-interactive startup
Starting background readahead: [ OK ]
Checking for hardware changes [ OK ]
Bringing up loopback interface: [ OK ]
Bringing up interface eth0:
Determining IP information for eth0… failed.
[FAILED]
Starting auditd: [FAILED]
curl: (7) Failed to connect to 169.254.169.254: Network is unreachable
Starting system logger: [ OK ]
Starting kernel logger: [ OK ]
Starting syslog-ng: [ OK ]
Starting irqbalance: [ OK ]
Starting system message bus: [ OK ]
Mounting other filesystems: [ OK ]
Starting sshd: [ OK ]
Starting cups: [ OK ]
Starting MySQL: [ OK ]
Starting postfix: [ OK ]
curl: (7) Failed to connect to 169.254.169.254: Network is unreachable
Starting httpd: Warning: DocumentRoot [/dev/null] does not exist
Warning: DocumentRoot [/dev/null] does not exist
Warning: DocumentRoot [/dev/null] does not exist
httpd: Could not reliably determine the server’s fully qualified domain name, using localhost.localdomain for ServerName
[ OK ]
Starting crond: [ OK ]
Starting process accounting: [ OK ]
Starting atd: [ OK ]
Starting jexec: Starting jexec services[ OK ]
Starting HAL daemon: [ OK ]
+ Updating ec2-ami-tools
curl: (6) Couldn’t resolve host ’s3.amazonaws.com’
c
CentOS release 5 (Final)
Kernel 2.6.16-xenU on an i686
localhost l

Luckily I had backups, so I booted a new instance and restored it it. It looks like I posted one article since my last backup. I reckon I’ll be able to get that back from a google cache.

It took me about 30 minutes to start the instance, restore the backup and test everything out. This time around, I set up an elastic IP. If this happens again, I won’t have to update DNS.

13 Nov

Confluence and Google Code Prettify

EDIT - this post was lost due to a server crash. I’m hoping to put it back in the same spot, but I’m not sure if wordpress will allow me to do that without ugliness.

I mostly like Confluence. I don’t like the {code} blocks. The highlighting and formatting sucks. There’s a limited number of languages supported and I don’t really code in any of them. The list from the documentation says:

Makes a preformatted block of code with syntax highlighting. All the optional parameters of {panel} macro are valid for {code} too. The default language is Java but you can specify JavaScript, ActionScript, XML, HTML and SQL too.

I’d really like a Confluence plugin that used the Google Code Prettify javascript code. These seem like obvious things to mate together, since you wouldn’t have to worry about implementing new languages.  Quite often, I write psuedo code in the wiki and the JS code does a decent job of figuring it out. I searched Google for confluence google-code-prettify, since someone must have built a plugin already. The results were not promising.

Dear Lazy Web, please see what you can do.

07 Nov

Atlassian Bamboo and Perl Test Harness

At my current gig, we use Atlassian Bamboo as a Continuous Integration server. It plugs into the rest of our Atlassian tools, which is nice.

It took me a bit to get my perl test cases to work with it, but with the help of TAP::Harness::JUnit, I finally got it to work. It’s really easy too:

  1. Install TAP::Harness::JUnit
  2. Put the following script into you code repository. Put it somewhere that you can call it easily when doing a build.
    #!/usr/bin/perl
    use strict;
    use warnings;
    use TAP::Harness::JUnit;
    my $outputfile = shift;
    my $harness = TAP::Harness::JUnit->new({
    xmlfile => $outputfile,
    });
    $harness->runtests(@ARGV);
  3. Instead of calling make test for you perl modules, run:
    perl $SCRIPTFROMABOVE -Iblib/lib $OUTPUTFILE t/*.t

$OUTPUTFILE should be an xml file in whatever directory you have configured Bamboo to look in for test reports.

This should work with any CI server that can read JUnit xml output.

22 Oct

Excluding Experts Exchange from Search Results

Experts Exchange drives me crazy. When I’m trying to solve a problem and I click an Experts Exchange link, I feel like I’ve been duped and I’ll never get those precious seconds back. I’ve been complaining loudly about this to my friends off and on for a while. One of them suggested a solution the other day.

I once found an article about setting up a customer search engine that would allow you to permanently exclude sites. I’d have to remember to use my custom search engine every time, so I never tried it out.

My friend suggested that I use the Firefox plugin CustomizeGoogle. It allows you to filter sites in your results. I added /experts-exchange/ to the filter list and now Experts Exhange links are greyed out:

Experts Exchange Excluded!

Hooray! No more bait and switch for me!

16 Oct

EC2 and Ganglia

I’ve been playing around with Ganglia for the past couple of days, trying to make it to work with EC2. It was a bit of an adventure. There are two keys for running Ganglia on EC2: use unicast and set send_metadata_interval.

Amazon doesn’t support multicast on their network, so the default configs for Ganglia don’t work. You need to pick your head gmond server and set your udp_send_channel to something like:

udp_send_channel {
host = $headserver
port = 8649
ttl = 1
}

In your globals section, you also need to make sure that send_metadata_interval is set to something other than 0. From the mailing list:

Yep, you ran across the same dilemma I had when I wrote it :/ The problem is that in unicast mode, there is no requirement for any of the agents to be listening (ie. deaf = yes) since the individual nodes don’t need to do anything more than send their own current metric to the host node. So in this instance, attempting to send a request for metadata back to the node wouldn’t work. That is why I ended up just implementing the send_metadata_interval directive. The downside as you pointed out, is that more data is being passed needlessly on the wire however the amount of data should be less than with the older scheme. The reason why I say that is because under the old scheme, all meta data for any gmetric or modular metric was sent with every value packet rather than being sent independently. The intent was that the end_metadata_interval would be set to something on the order of minutes rather than seconds. This would mean that you might lose a few minutes of data if the host gmond were restarted, but the amount of useless metadata packets would be much less.

That last bit took some digging around to figure out. Without it, gmond knew about other hosts, but ignored the actual stats. Gmond wouldn’t record any info and graphs and data were missing.

© 2009 The Cult of Gary | Entries (RSS) and Comments (RSS)

GPSwordpress logo