Showing posts with label ROCKS 5.4.3. Show all posts
Showing posts with label ROCKS 5.4.3. Show all posts

19 March 2012

111. Ecce (nwchem) on Debian, and ROCKS/Centos

If you're using nwchem chances are that you've considered using ECCE to parse the output:
http://ecce.emsl.pnl.gov/

First of all you'll need to register at https://eus.emsl.pnl.gov/Portal/ -- and you can only do that if you're faculty. Postdocs and PhD students need not apply. Other than that, it's free, but you'll have to wait a couple of days to get your registration approved.

As much as I like nwchem owing to the clear syntax, I feel less warmly about ecce. Don't get me wrong -- it's pretty. It's just feels archaic and cobbled together. Even worse is that it's not open source and that its workings feel a bit opaque at times. Still, there's no better program for visually parsing nwchem output at this point. Anyway...

--start here --
Debian:
Download the install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh file to ~/tmp/ecce

There's no md5sum supplied but here's what I got:
2ee70cc817dee9f80b11be5eac6e53e5

If you haven't already
sudo apt-get install csh 

OK, moving on...
cd ~/tmp/ecce
chmod +x  install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
./install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh


Main ECCE installation menu
===========================
0) Help on main menu options
1) Full install
2) Full upgrade
3) Application software install
4) Application software upgrade
5) Server install
6) Server upgrade

Pick 1 if you're installing on your desktop and there's no server that you know of. 

Once the installation is over you get:
***************************************************************
!! You MUST perform the following steps in order to use ECCE !!
-- Unless only the user 'me' will be running ECCE,
   start the ECCE server as 'me' with:
     /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
-- To register machines to run computational codes, please see
   the installation and compute resource registration manuals
   at http://ecce.pnl.gov/using/installguide.shtml
-- To run ECCE each user must source either the runtime_setup
   (csh/tcsh) or runtime_setup.sh (sh/bash/ksh) script in the
   directory /home/me/tmp/ecce/ecce-v6.2/apps/scripts
   from their shell environment setup script.  For example,
   with csh or tcsh, add the following to ~/.cshrc:
     if (-e /home/me/tmp/ecce/ecce-v6.2/apps/scripts/runtime_setup) then
       source /home/me/tmp/ecce/ecce-v6.2/apps/scripts/runtime_setup
     endif
***************************************************************
Which translates to:
1. sh  /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
2. Sourcing that file makes no sense. Instead, add the following to your ~/.bashrc
export ECCE_HOME=/home/me/tmp/ecce/ecce-v6.2/apps
export PATH=${ECCE_HOME}/scripts:${PATH}

Assuming you've source your ~/.bashrc, start ecce by typing
ecce

...which takes an unreasonably long time (ca 1 min) after which you're greeted by
Press Any Key
Type in a password -- any password -- which will be your password from now on.
You're then taken to
Click on Viewer (assuming you've got something to look at)
Pay attention to the fine print
Have a look at the text box in the bottom right corner..and pay attention. In my particular case I have 6 cores and an mpi aware nwchem 6.0 version compiled. I bet that's better than whatever comes bundled with ecce. Also, the

To change you go to the machine browser (see screen shot #2), click on set up remote access and make sure that everything is working by clicking on e.g. processes:

Then click on the Machine menu (top left), select Register Machine while your machine is selected.
You can now change your options.

Running:
So, before using ecce you always need to
sh  /home/me/tmp/ecce/ecce-v6.2/server/ecce-utils/start_ecce_server
first. The server will run until you stop it or reboot.
Next, start ecce
ecce

Integration with nwchem
Most people would probably set up their nwchem jobs by hand, because it's so simple. All you need to do is to include the statement
ecce_print ecce.out
in the beginning, and you'll get an ecce.out file which you can then IMPORT (not open regularly, but import) into ecce.

Click on Viewer, Import Calculation From Output File, select your ecce out and voilá:
ECCE: homo (benzene)
If you're running debian, you're done now.



ROCKS 5.4.3/Centos 5.6:
This isn't a fix as much as a rant. The problem with ROCKS 5.4.3 is that csh is so broken that it's a struggle just to install ecce. I mean, I do show how to get ecce running in the end, but ROCKS feels like an unfinished piece of work compared to a normal debian install.

--Demonstration only -- don't do --
First back up ssh-key.sh and ssh-key.csh in /etc/profile.d

So...you start by
chmod +x install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
./install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh
...and nothing's happening.

You then try just typing in
csh

/etc/profile.d/ssh-key.sh: line 211: return: can only `return' from a function or sourced script
It appears that you have not set up your ssh key.
This process will make the files:
     /export/home/me/.ssh/id_rsa.pub
     /export/home/me/.ssh/id_rsa
     /export/home/me/.ssh/authorized_keys
Generating public/private rsa key pair.
/export/home/me/.ssh/id_rsa already exists.
Overwrite (y/n)? 

Turns out there's a bug in ROCKS 5.4.3.  You can fix that by:
rpm -Uvh ftp://www.rocksclusters.org/pub/rocks/updates/5.4.3/x86_64/RPMS/rocks-config-server-5.4.3-1.x86_64.rpm

So far so good.
csh
...and nothing. It just exits. Or so you think. But the problem is bigger than that --  try opening a new terminal in e.g. gnome (gnome-terminal or xterm) -- it exits immediately. No error message or anything.

You can get csh to start by moving /etc/csh.cshrc out of the way, but you're still screwed as to opening a new terminal. The only way to get back a working system is to restore ssh-key.sh and ssh-key.csh.

--- Demonstration over ---

--Start here --
 You could also get around all this by running
csh -f
But then you don't have any env. variables loading and it can lead to problems of its own.

Anyway:
csh -f install_ecce.v6.2.rhel5-gcc3.2.3-m32.csh

The install starts. Just follow the instructions.

After installation, start the server:
csh -f ecce-v6.2/server/ecce-utils/start_ecce_server

Hit enter until you get a workable prompt back...
Edit your ~/.bashrc and add

export ECCE_HOME=/home/me/tmp/ecce/ecce-v6.2/apps
export PATH=${ECCE_HOME}/scripts:${PATH}

Don't bother sourcing your ~/.bashrc. It's easier to just open a new terminal.
Type
ecce
and you should be up and running...sort of. Under ROCKS I had problems importing ecce.out files since I had problems actually connecting to the server. Don't know why, but it came down to not being able to open a remote shell on the host.

NOTE:
this worked fine on one box, but not on another one which I was setting up remotely. On that one I had to edit

ecce/apps/siteconfig/Dataservers
and
ecce/apps/siteconfig/jndi.properties 

In particular, I had to change references to eccetera.emsl.pnl.gov.

15 March 2012

108. Building local version of sinfo without root/sudo on ROCKS/CentOS

Edit 04/04/2012: there were several errors and omissions. These have been fixed now.

Because I don't want to mess up a cluster which is on a different continent I'm trying to use my superuser powers as little as possible.

Here's how to make a local version of sinfo -- you'll still need to make sinfod runs as a service on all the nodes.

There's no reason the instructions here shouldn't work on most linux distros, including Debian.

boost:
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/export/home/me/.libboost


Edit tools/build/user-config.jam and add
using mpi ;
The space between mpi and ; is needed.

Start installation:
./b2 install

cd /export/home/me/.libboost/lib
ln -s libboost_signals.so libboost_signals-mt.so
ln -s libboost_serialization.so libboost_serialization-mt.so
ln -s libboost_date_time.so libboost_date_time-mt.so
ln -s libboost_wserialization.so libboost_wserialization-mt.so
ln -s libboost_regex.so libboost_regex-mt.so


asio:
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/

./configure --prefix=/export/home/me/.asio --with-boost=/export/home/me/.libboost/include
make
make install

sinfo/d:
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/

export LIBS=-L/export/home/me/.libboost/lib
export LDFLAGS=$LIBS
export CPPFLAGS="-I/export/home/me/.libboost/include -I/export/home/me/.asio/include/"
./configure --prefix=/export/home/me/.sinfo --disable-IPv6
make

make install 

Getting started:
In order to make something happen at boot you need sudo/root access. However, HPC clusters are rarely rebooted, so even if you launch something as a user it will persist for a long time. If you're lucky the right ports are open -- and they should be open between nodes.

You also need to add this to your ~/.bashrc:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/export/home/me/.libboost/lib

Start sinfod (the daemon) using:
~/.sinfo/sbin/./sinfod --quiet

ps aux |grep sinfod 
will show it it's running

And check that everything is ok using
~/.sinfo/bin/./sinfo



13 March 2012

106. htop 1.0.1 and sinfo-0.0.45 on rock 5.4.3/centos 5.6

There are a number of performance monitor tools in the debian repos. ROCKS 5.4.3/Centos doesn't seem quite as well-equipped.

First out, htop:

htop:
wget http://downloads.sourceforge.net/project/htop/htop/1.0.1/htop-1.0.1.tar.gz
tar -xvf htop-1.0.1.tar.gz
cd htop-1.0.1/
./configure --prefix=/home/me/.htop
make
make install

It's as simple as that.
Add e.g.
alias htop='/home/me/.htop/bin/htop'
to your ~/.bashrc
Note: this works on Scientific Linux (boron) 5.4 as well.

sinfo:
Update 13/03/2012:
Sinfo <0.0.44 has IPv6 enabled by default.
On sinfo >=0.0.45 you can disable IPv6 using ./configure --disable-IPv6

Sinfo is probably the snazziest cluster monitoring tool that I know of. Sure, ganglia etc. are nice too, but they run as web service. Sinfo is a 'simple' curses program, but building it on CentOS was a bit of a challenge.

Be aware that sinfo versions prior to 0.045 expect ipv6 to work -- by default ROCKS disables IPv6, so use sinfo 0.0.45 and above.





First boost:
(yum install boost-devel didn't do anything for me)
cd ~/tmp
wget http://sourceforge.net/projects/boost/files/boost/1.49.0/boost_1_49_0.tar.gz/download
tar -xvf boost_1_49_0.tar.gz
cd boost_1_49_0/
./bootstrap.sh --prefix=/usr

Edit Jamroot and add
using mpi ;
The space between mpi and ; is needed.

Symlink to your mpic++, e.g. if your mpic++ is in /opt/openmpi:
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++

The following step takes a long time:
sudo ./b2 -a install --layout=versioned --build-type=complete

These days all the libboost libs are multithread aware (or so I hear), and in debian it turns out that the -mt.so libs are just symbolic links to the 'regular' libs.
sudo ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so
sudo ln -s /usr/lib/libboost_date_time.so /usr/lib/libboost_date_time-mt.so
sudo ln -s /usr/lib/libboost_serialization.so /usr/lib/libboost_serialization-mt.so
sudo ln -s /usr/lib/libboost_wserialization.so /usr/lib/libboost_wserialization-mt.so
sudo ln -s /usr/lib/libboost_regex.so /usr/lib/libboost_regex-mt.so

sudo ln -s /usr/lib/libboost_signals.so.1.49.0 /usr/lib64/libboost_signals.so.1.49.0

Then asio
cd ~/tmp
wget "http://downloads.sourceforge.net/project/asio/asio/1.5.3%20%28Development%29/asio-1.5.3.tar.bz2?r=http%3A%2F%2Fsourceforge.net%2Fprojects%2Fasio%2F&ts=1331441086&use_mirror=aarnet"
tar -xvf asio-1.5.3.tar.bz2
cd asio-1.5.3/
./configure
make
sudo make install

Then sinfo
cd ~/tmp
wget http://www.ant.uni-bremen.de/whomes/rinas/sinfo/download/sinfo-0.0.45.tar.gz
tar -xvf sinfo-0.0.45.tar.gz
cd sinfo-0.0.45/
./configure --disable-IPv6

The build should be fine.

Configuration:
you'll end up with
/usr/local/sbin/sinfod
/usr/local/bin/sinfo
You may want to make sure there are paths to them by adding the following to your ~/.bashrc:
export PATH=$PATH:/usr/local/bin:/usr/local/sbin
The changes take effect next time you log in to a shell, or just run
source ~/.bashrc
for immediate effect.

Also, create a file called /etc/default/sinfo with the following in it:
OPTS="--quiet --bcastaddress=192.168.1.255"

Start sinfod with
sinfod --quiet --bcastaddress=192.168.1.255

then check that it's running
ps aux | grep sinfod

If it's not running, then try
sinfod -F

If it gives something along the lines of
exception:open:address family not supported
you most likely
1) haven't enabled ipv6 for your interface and
2) didn't disable IPv6 during compilation and/or
3) used version<0.045

Check by doing ifconfig -- does it return both an ipv4 and an ipv6 address?

Enabling ipv6
Unless you know what you're doing, don't fiddle with the network interfaces on a production cluster -- network interfaces on a multinode cluster are typically highly tuned to minimise latency, so don't mess it up.

Anyway. First check your /etc/modules.conf and - if present - comment out
alias ipv6 off
options ipv6 disable=1
Edit your /etc/sysconfig/network-scripts/ifcfg-eth1
DEVICE=eth1:0
IPADDR=192.168.1.111
NETMASK=255.255.255.0
BOOTPROTO=none
MTU=1500
TYPE=Ethernet
GATEWAY=192.168.1.1
USERCTL=no
IPV6INIT=yes
PEERDNS=yes
ONPARENT=yes
IPV6ADDR=fe80::2f0:4dff:f383:b44/64
IPV6_DEFAULTGW=fe80::2f0:4dff:fe83:a48/64
I just made up the IPV6ADDR, and took the IPV6_DEFAULTGW from my gateway machine (running debian, so ipv6 enabled by default)

Assuming that your firewall is allowing traffic at port 60003 and free traffic in and out on 192.168.1.255 things should work fine.



Errors


Error (boost):
MPI auto-detection failed: unknown wrapper compiler mpic++
Please report this error to the Boost mailing list: http://www.boost.org
You will need to manually configure MPI support.
Solution:
make sure you've symlinked to your mpic++ instance in /usr/bin
e.g. if your mpic++ is in /opt/openmpi/bin/mpic++
sudo ln -s /opt/openmpi/bin/mpic++ /usr/bin/mpic++


Error (sinfo):
message.cc: In member function 'void Message::popFrontMemory(void*, size_t)':
message.cc:183: error: 'memory' was not declared in this scope
message.cc:193: error: 'boost' has not been declared
message.cc:193: error: expected primary-expression before 'char'
message.cc:193: error: expected `;' before 'char'
message.cc:196: error: 'newMemory' was not declared in this scope
message.cc:196: error: 'memory' was not declared in this scope
make[2]: *** [message.lo] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make[1]: *** [all-recursive] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessage'
make: *** [all-recursive] Error 1
Solution:
You need to make sure that the libs are found -- either symlink manually between your build directory and /usr/lib, or use boostrap.sh --prefix=/usr. See above for how to do it.

Error (sinfo):
udpmessagereceiver.h:14: error: 'asio' has not been declared
udpmessagereceiver.h:14: error: ISO C++ forbids declaration of 'endpoint' with no type
udpmessagereceiver.h:14: error: expected ';' before 'sender_endpoint'
udpmessagereceiver.h:16: error: 'asio' has not been declared
udpmessagereceiver.h:16: error: ISO C++ forbids declaration of 'io_service' with no type
udpmessagereceiver.h:16: error: expected ';' before '&' token
udpmessagereceiver.h:17: error: 'asio' has not been declared
udpmessagereceiver.h:17: error: ISO C++ forbids declaration of 'socket' with no type
udpmessagereceiver.h:17: error: expected ';' before 'sock'
udpmessagereceiver.h:20: error: expected ',' or '...' before '::' token
udpmessagereceiver.h:20: error: ISO C++ forbids declaration of 'asio' with no type
udpmessagereceiver.h:23: error: 'asio' has not been declared
udpmessagereceiver.h:23: error: expected `)' before '&' token
udpmessagereceiver.cc:5: error: 'asio' has not been declared
udpmessagereceiver.cc:5: error: expected `)' before '&' token
make[1]: *** [udpmessagereceiver.lo] Error 1
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/libmessageio'
make: *** [all-recursive] Error 1

Solution: you've only got boost::asio installed, not the independent asio. See above for how to compile and install asio.

Error (sinfo):

/usr/bin/ld: cannot find -lboost_signals-mt
collect2: ld returned 1 exit status
make[2]: *** [sinfod] Error 1
make[2]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make[1]: *** [all] Error 2
make[1]: Leaving directory `/state/partition1/home/me/tmp/sinfo-0.0.44/sinfod'
make: *** [all-recursive] Error 1
Solution:
You need a symlink pointing form /usr/lib/libboost_signals-mt.so to /usr/lib/libboost_signals.so
ln -s /usr/lib/libboost_signals.so /usr/lib/libboost_signals-mt.so 

Error (sinfod):
sinfod --quiet --bcastaddress=192.168.1.255 gives nothing and sinfod exits silently immediately
sinfod -F gives
exception:open:address family not supported
Here's the relevant strace output:
[..]
 socket(PF_INET, SOCK_DGRAM, IPPROTO_UDP) = 6
[..]
 socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDP) = -1 EAFNOSUPPORT (Address family not supported by protocol)
futex(0x333a40d350, FUTEX_WAKE_PRIVATE, 2147483647) = 0
close(6)                                = 0
close(3)                                = 0
close(4)                                = 0
close(5)                                = 0
write(2, "Exception: ", 11)             = 11
write(2, "open: Address family not support"..., 46) = 46
write(2, "\n", 1)                       = 1
exit_group(0)                           = ?

Solution: enable ipv6 (see above)

105. Nwchem 6.1 with openmpi on ROCKS 5.4.3/CentOS 5.6


EDIT 18 May 2012: 
Compiling nwchem 6.1 with internal libs on debian: http://verahill.blogspot.com.au/2012/05/compiling-nwchem-61-with-internal-libs.html
Compiling nwchem 6.1 with openblas on debian: http://verahill.blogspot.com.au/2012/05/building-nwchem-61-on-debian.html


I can build and use nwchem on ROCKS 5.4.3 -- see instructions below.

EDIT: the gfortran version is GNU Fortran (GCC) 4.1.2 20080704 (Red Hat 4.1.2-50)
On debian, which yields a segfaulting binary, the version is GNU Fortran (Debian 4.6.3-1) 4.6.3

I'm still having no luck building binaries which don't segfault on execution on debian though. The openmpi versions are the same for both ROCKS and debain: 1.4.3.

--START HERE --

ROCKS 5.4.3/CentOS
The build is essentially the same as for nwchem-6.0 (http://verahill.blogspot.com.au/2012/03/building-nwchem-60-on-rocks-543centos.html) - the single difference is that you need to define USE_MPIF4 or you get errors

To build:

wget http://www.nwchem-sw.org/images/Nwchem-6.1-2012-Feb-10.tar.gz
tar -xvf Nwchem-6.1-2012-Feb-10.tar.gz
cd nwchem-6.1/
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/export/home/me/tmp/nwchem-6.1
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export USE_MPI=y
export USE_MPIF=y
export USE_MPIF4=y
export MPI_LOC=/opt/openmpi
export MPI_INCLUDE=/opt/openmpi/include
export LIBRARY_PATH=$LIBRARY_PATH:/opt/openmpi/lib
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make  nwchem_config
make  FC=gfortran

Building takes a little while.

Running:
Make sure that you make the reference to your openmpi libs permanent and make life easier by putting the following in your ~/.bashrc or /etc/profile:

export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib

export NWCHEM_EXECUTABLE=/export/home/me/tmp/nwchem-6.1/bin/LINUX64/nwchem
export NWCHEM_BASIS_LIBRARY=/export/home/me/tmp/nwchem-6.1/src/basis/libraries/
PATH=$PATH:/export/home/me/nwchem-6.1/bin/LINUX64



To run on multiple procs do
mpirun -n 3 nwchem input.nw
where 3 is the number of cores

104. Building gromacs with fftw3 and openmpi on ROCKS 5.4.3/CentOS

This guide was heavily modified on 13/03/2012 to remove the need for sudo/root privileges.

Not all flavours of linux are equal. I've always been a Debian man, but have recently become a user of a ROCKS based HPC cluster on a different continent. To make sure that I don't screw things up I'm currently trying to work out how to reliably compile common computational packages under ROCKS 5.4.3, which is CentOS based.

If you installed the bio roll from the beginning you'll have openmpi in /opt/openmpi (rocks_openmpi.x86_64 package), and fftw in /opt/rocks/lib and /opts/rocks/include (fftw.x86_64 package)

If you only installed the basic rolls, you won't have either. Now, you can either download the bio roll and install from there, or you can install the regular openmpi package and compile fftw yourself. In fact, you'll need to do the latter if you want double-precision gromacs anyway.

My goal is to avoid having to use sudo or root at all. I've rewritten this guide a couple of times, so there may be weird annoying errors that I've missed.

Installing openmpi:
If you don't have openmpi in /opt, then you can install it from the base roll
sudo yum install openmpi


fftw3:
You can skip this step IF
1. you have fftw files in /opt/rocks/lib and /opt/rocks/include
AND
2. you only want single precision

Otherwise:

wget http://www.fftw.org/fftw-3.3.1.tar.gz
tar -xvf fftw-3.3.1.tar.gz
cd fftw-3.3.1


Then use --prefix to tell make where to install the files:

Single precision fftw3 libraries:
make distclean
./configure --enable-float --enable-mpi --enable-threads --with-pic --prefix=/export/home/me/.fftwsingle
make
make install

Double-precision fftw3 libraries:
make distclean
./configure --disable-float --enable-mpi --enable-threads --with-pic --prefix=/export/home/me/.fftwdouble
make 
make install

gromacs:

First download and extract:

cd ~/tmp
wget ftp://ftp.gromacs.org/pub/gromacs/gromacs-4.5.5.tar.gz

tar -xvf gromacs-4.5.5.tar.gz
cd gromacs-4.5.5/

Before building you need to define where the openmpi libs are i.e.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib
OR
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib64/openmpi/1.4-gcc/lib

We now have three permutations of possible builds:
1. We use the single precision fftw libs in /opt/rocks/lib and /opt/rocks/include
export LDFLAGS=-L/opt/rocks/lib
export CPPFLAGS=-I/opt/rocks/include
./configure --enable-mpi --enable-float --with-fft=fftw3 --program-suffix=_spmpi --prefix=/export/home/me/gromacs
make
make install

2. We use the single precision fftw libs in /export/home/me/.fftwsingle

export LDFLAGS=-L/export/home/me/.fftwsingle/lib
export CPPFLAGS=-I/export/home/me/.fftwsingle/include
./configure --enable-mpi --enable-float --with-fft=fftw3 --program-suffix=_spmpi --prefix=/export/home/me/gromacs
make
make install

3. We use the double precision fftw libs in /export/home/me/.fftwdouble

export LDFLAGS=-L/export/home/me/.fftwdouble/lib
export CPPFLAGS=-I/export/home/me/.fftwdouble/include
./configure --enable-mpi --disable-float --with-fft=fftw3 --program-suffix=_ddmpi --prefix=/export/home/me/gromacs
make
make install


Running

You will now have single and double-precision binaries, e.g.
grompp_spmpi and grompp_ddmpi

Make sure that you define/have defined LD_LIBRARY_PATH in /etc/profile or ~/.bashrc and included the paths to your mpi libs and your fftw libs, e.g.:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib:/export/home/me/.fftwsingle:/export/home/me/.fftwdouble

Actually, it doesn't seem necessary to include the fftw path.

You may also want to include your gromacs bins in your path:
export PATH=$PATH:/export/home/me/gromacs/bin

Dynamic load-balancing seems to be disabled by default, so to use multiple cores run using e.g.
mpirun -n 4 mdrun_spmpi -s inp.tpr -o out.trr etc.

DONE


Troubleshooting

Error:
checking size of off_t... configure: error: in `/export/home/me/tmp/gromacs-4.5.5':
configure: error: cannot compute sizeof (off_t)
See `config.log' for more details
config.log:
./conftest: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory
Solution:
Set LD_LIBRARY_PATH to your openmpi libs e.g.
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib

Error:
/usr/local/lib/libfftw3f.a: could not read symbols: Bad value
collect2: ld returned 1 exit status
make[3]: *** [libmd.la] Error 1
Solution:
Compile fftw3 using the --with-pic switch:
./configure --enable-float --enable-mpi --enable-threads --with-pic 




103. Building nwchem 6.0 on Rocks 5.4.3/CentOS

I've always been a Debian man, but for various reasons I need to be able to compile various scientific packages on a HPC running ROCKS. ROCKS 5.4.3 is based on CentOS 5,6and it turns out that debian is wonderfully easy, accommodating and robust in comparison. Well, since it's not my HPC, CentOS is what I'm stuck with.

Here's how to build nwchem on a rocks 5.4.3 (viper) cluster based on CentOS 5.6 and its ancient kernel.
(Linux  2.6.18-238.19.1.el5 #1 SMP Fri Jul 15 07:31:24 EDT 2011 x86_64 x86_64 x86_64 GNU/Linux )

There are three different approaches:




CASE 1.
 Using LD_LIBRARY_PATH
This method requires no root access.
Check to see whether you've installed the rocks-openmpi package from the bio roll - it should be in /opt/openmpi. Otherwise use yum to install the base-roll openmpi package, which will end up in /usr/lib64/openmpi/1.4-gcc/lib -- you'll need root or sudo to do anything with yum.

For compilation, do
export LIBRARY_PATH=$LIBRARY_PATH:/opt/openmpi/lib
or
export LIBRARY_PATH=$LIBRARY_PATH:/usr/lib64/openmpi/1.4-gcc/lib/
depending on whether there is an openmpi directory in /opt or not.

You can also put the export line in your buildconf.sh below
For execution:
in either you ~/.bashrc (user basis) or /etc/profile (global) put
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib






CASE 2. /opt/openmpi is present; using symlinked libs

mpicc and mpif77 are probably already symlinked, but if not:

sudo ln -s /opt/openmpi/bin/mpicc /usr/bin/mpicc
sudo ln -s /opt/openmpi/bin/mpif77 /usr/bin/mpif77


The following allows for building and running:
sudo ln -s /opt/openmpi/lib/libmpi.so /usr/lib/libmpi.so
sudo ln -s /opt/openmpi/lib/libopen-rte.so /usr/lib/libopen-rte.so
sudo ln -s /opt/openmpi/lib/libopen-pal.so /usr/lib/libopen-pal.so
sudo ln -s /opt/openmpi/lib/libmpi_f77.so /usr/lib/libmpi_f77.so
sudo ln -s /opt/openmpi/lib/libmpi.so /usr/lib64/libmpi.so.0
sudo ln -s /opt/openmpi/lib/libopen-rte.so /usr/lib64/libopen-rte.so.0
sudo ln -s /opt/openmpi/lib/libopen-pal.so /usr/lib64/libopen-pal.so.0
sudo ln -s /opt/openmpi/lib/libmpi_f77.so /usr/lib64/libmpi_f77.so.0


the /usr/lib64 symlinks are necessary for execution, or you'll get
./nwchem: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory



CASE 3. /opt/openmpi is NOT present; using symlinked libs

yum install openmpi openmpi-devel
And then put in all the symlinks...dunno why this isn't done on install, but there you go.

sudo ln -s /usr/local/lib64/openmpi/1.4-gcc/bin/mpicc  /usr/bin/mpicc
sudo ln -s /usr/local/lib64/openmpi/1.4-gcc/bin/mpif77 /usr/bin/mpif77
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libmpi.so /usr/lib/libmpi.so
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libopen-rte.so /usr/lib/libopen-rte.so
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libopen-pal.so /usr/lib/libopen-pal.so
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libmpi_f77.so /usr/lib/libmpi_f77.so

Using the above symlinks compilation will work just fine.
However, in order to actually run nwchem you need
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libmpi.so /usr/lib64/libmpi.so.0
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libopen-rte.so /usr/lib64/libopen-rte.so.0
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libopen-pal.so /usr/lib64/libopen-pal.so.0
sudo ln -s /usr/lib64/openmpi/1.4-gcc/lib/libmpi_f77.so /usr/lib64/libmpi_f77.so.0

or you'll get
./nwchem: error while loading shared libraries: libmpi.so.0: cannot open shared object file: No such file or directory
Finally, make sure we can find our mpirun:
sudo ln -s /usr/lib64/openmpi/1.4-gcc/bin/mpirun /usr/bin/mpirun


ALL CASES
Continue here:
We'll be working in /export/home/me/tmp
wget http://www.nwchem-sw.org/images/Nwchem-6.0.tar.gz
tar -xvf Nwchem
cd nwchem-6.0

create a file called buildconf.sh and stuff it with the following:
export LARGE_FILES=TRUE
export TCGRSH=/usr/bin/ssh
export NWCHEM_TOP=/export/home/me/tmp/nwchem-6.0
export NWCHEM_TARGET=LINUX64
export NWCHEM_MODULES=all
export USE_MPI=y
export USE_MPIF=y
export MPI_LOC=/usr/lib64/openmpi/1.4-gcc/lib
export MPI_INCLUDE=/usr/lib64/openmpi/1.4-gcc/include
export LIBMPI="-lmpi -lopen-rte -lopen-pal -ldl -lmpi_f77 -lpthread"
cd $NWCHEM_TOP/src
make clean
make nwchem_config
make FC=gfortran
NOTE: the above buildconf.sh works for the case when you installed openmpi yourself (CASE 2 or 3). If it got installed with ROCKS on setup and is present in /opt/openmpi (CASE 1 or 3) change the following:

export MPI_LOC=/opt/openmpi/lib
export MPI_INCLUDE=/opt/openmpi/include
Launch the build

sh buildconf.sh

You'll end up with a binary called nwchem in nwchem--6.0/bin/LINUX64 -- you can put a PATH to it in your ~/.bashrc


CASE 3
For execution you will need to make sure nwchem can find the openmpi libs --
echo $LD_LIBRARY_PATH
will tell you whether the path is included by default.
Otherwise, in either you ~/.bashrc (user basis) or /etc/profile (global) put
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/openmpi/lib


Running
If you move nwchem out of the compilation directory (to say /usr/local/nwchem) you may also want to define e.g.

export NWCHEM_TOP=/usr/local/nwchem-6.0
export NWCHEM_TARGET=LINUX64
export NWCHEM_BASIS_LIBRARY=${NWCHEM_TOP}/libraries/

Again, this goes into your .bashrc or /etc/profile, depending on scope.

To use multiple cores, do
mpirun -n 4 nwchem jobname.nw
where the number of cores is 4.


Errors and troubleshooting:
If you get errors about libraries missing or mpicc-related errors you should make sure that you've symlinked everything you need into the /usr/lib folder or set the LIBRARY_PATH (see above). You could probably edit /etc/ld.conf too, but it will get messy with time.

I also tried building using mpich2-1.2 as well as 1.4, but got error messages about undefined references left and right.

12 March 2012

102. Gnu Debugger (dgb) on CentOS/ROCKS 5.4.3

For a distro dedicated to HPC ROCKS seems to lack every single debugging tool that I'm familiar with. Here's another one: gdb

 --START HERE --


First compile texinfo:
cd ~/tmp
wget http://ftp.gnu.org/gnu/texinfo/texinfo-4.12.tar.gz
tar -xvf texinfo-4.12.tar.gz
cd texinfo-4.12/
./configure
make
sudo make install


ln -s /usr/loca/bin/makeinfo /usr/bin/makeinfo

Then gdb:
cd ~/tmp
wget http://ftp.gnu.org/gnu/gdb/gdb-7.4.tar.gz
tar -xvf gdb-7.4.tar.gz
cd gdb-7.4/
./configure
make
sudo make install

If you haven't symlinked makeinfo above you'll get errors.

Usage:
gdb programme
(gdb) run arg1 arg2



101. First adventures in ROCKS 5.4.3

I've recently been given access to a 40 core cluster running ROCKS 5.4.3, which is a customised version of CentOS 5.6. The notes below are older than the build instructions that I've recently posted.

They say that if you want to learn Debian, use Debian; if you want to learn CentOS, use CentOS, and if you want to learn linux, build LFS. I can't vouch for the last item (yet...), but I'd definitely agree with the first two statements. CentOS and Debian are just different enough that it takes a while before you find your way around CentOS if you're used to Debian.

Anyway, with the hope that this might be useful to someone in a similar situation:

Installation
The ROCKS installer is crap. There's no way around it.
Anyway, the first time you boot up from the CD or DVD you get this splash screen (this is from an earlier vbox installation):


You better type
build
quickly or you'll end up in a dead window.

There's an annoying question about the fully qualified domain name -- and it won't accept invalid FQDNs  -- which will screw things up later if you want to change it. I'll leave that one as a challenge.

Assuming you typed build, and everything worked ok up to this point (how about a 'back' button?), you get to choose whether to partition manually or automatically -- with debian I always do it manually, because why not?

Well, with ROCKS it took me a number of tries before I got it right -- and if you get it wrong it crashes and YOU HAVE TO START OVER AGAIN. How about having a 'back' button and clearly displaying the minimum requirements in the gparted screen? To be fair, it's mentioned if you read the instructions on the rocksclusters.org website, but who'd do that?

Anyway, it seems that you need, at a minimum:
16GB : /
3.6 GB: /var
The rest of the disk > 4 GB : /state/partition1 OR /export/home
Either seems to work

Keep that in mind if you're making a virtual machine image -- you'll need a pretty darn big one.

Anyway, presuming that everything works out you'll finish the installation and you'll get to your first boot.

First boot:
There are a few things that I don't like about the default setup

Create your locate database
As root:
updatedb

Create a user
By default there's only root -- apart from preferring to gain superuser powers via sudo, we most definitely need to have normal users present too.
adduser verahill
passwd verahill
To log in immediately
su verahill

First time you log in it will create an RSA keypair -- you're asked to set passwords for the keys -- don't confuse that password with your user password (although it can be the same).

Oh, and change those ugly b/w terminal colours to e.g. fg #FCF2F2 and bg #0E0C56 (this is more a hint for my future self)

Give your user superuser powers:
As root
visudo
and add
verahill ALL=(ALL:ALL) ALL

That'll do the trick
 
/etc/fstab
fstab uses labels by default to keep track of partitions. I don't like labels, and I don't like relative paths, when you can use UUID.
LABEL=/                 /                       ext3    defaults        1 1
LABEL=statepartition1       /state/partition1       ext3    defaults        1 2
LABEL=var       /var                    ext3    defaults        1 2

LABEL=SWAP-sda3         swap                    swap    defaults        0 0
and change to
UUID=779c8a5f-db6a-4433-a3e0-eaf4519e14b1                 /                       ext3    defaults        1 1
UUID=82835cfc-8b86-40b3-9412-f908908714be       /state/partition1       ext3    defaults        1 2
UUID=e286acd2-49cd-437b-bb1d-682faacb0628       /var                    ext3    defaults        1 2



To findout the UUIDs, do
 ls /dev/disk/by-uuid/ -lah
and to map the relative paths to the labels do  
ls /dev/disk/by-label/ -lah
I couldn't find the swap uuid, but I'm not too bothered by that. The example in the screenshot is more complex because I'm dualbooting using two physical harddrives.




/boot/grub/menu.lst
Again, a label tells grub where to find the root partition. No good. Change to UUID instead. Also, comment out hiddenmenu and change quiet to splash. It's grub '1', so you don't need to do update-grub or anything like that to make the changes take effect.

 screen
sudo yum install screen
 Just do the usual -- add  the following to /etc/screenrc
multiuser on
acladd verahill
and
sudo chmod +s /usr/bin/screen
sudo chmod 755 /var/run/screen

/etc/network/interfaces etc.
Well, they don't exist. Instead, you should go to /etc/sysconfig/network-scripts/
Each interface is configure by creating a file called ifcfg-ethX
You can set device specific routing using a file called route-ethX -- the route in the screen grab was to make sure that all traffic went via my gateway server.

Just look at the screen grab:

Oh, and it's not sudo service networking restart, it's sudo service network restart.

There's no /etc/hostname, instead it seems that you edit /etc/sysconfig/network
NETWORKING=yes
HOSTNAME=roxy

Also, you edit /etc/hosts.local, not /etc/hosts
192.168.1.111   roxy
192.168.2.111   foxy
(not easy coming up with names when you have 9 wired ifs in the same office)


I edited /etc/resolv.conf and added my DNS hosts directly -- so far, so good.

Enable ipv6
at the moment it seems that sinfo/d requires ipv6 to be enabled.  And by default it isn't -- change your modprobe to this (i.e. comment out anything about ipv6)

alias eth0 r8169
alias scsi_hostadapter sata_nv
alias net-pf-10 off
#alias ipv6 off
#options ipv6 disable=1
alias eth1 forcedeth
Yum
Compared to apt it's more yuck than yum, but each to their own.
It's pretty straightforward:
yum check-updates
yum install screen
yum erase screen
yum provides /screen
etc.

The repos seem to be defined in /etc/yum.conf


chkconfig
There's no rcconf or sysv-rc-conf, but there's chkconfig:
Be aware that run levels are not the same in CentOS as in Debian: http://www.centos.org/docs/5/html/Installation_Guide-en-US/s1-boot-init-shutdown-sysv.html

Typically you'd be in 3 or 5.

/opt
A lot of what you'll need for scientic endeavours is found in /opt IF YOU INSTALLED EVERYTHING FROM THE BEGINNING:
If you didn't, and e.g. installed openmpi by yourself, then it'll be in a completely different place. You'll be using locate a lot...

For some reason nothing's symlinked from /usr/lib and /usr/lib64, so be prepared to be doing a lot of that by hand (see my posts on build nwchem, sinfo and gromacs on centos/rocks)

/etc/profile
you might want to add
export PATH=$PATH:/sbin:/usr/sbin:/usr/local/bin:/usr/local/sbin



A bit of a restart and you might have a usable system.

100. Compile strace on ROCK 5.4.3

Maybe I've set things up wrong, but I can't find any strace package in the yum repos on my ROCKS 5.4.3 installation.

The compilation is very easy, but I'll show it here for those who feel nervous about compiling their own programmes:

mkdir ~/tmp
cd ~/tmp

The wget takes a while to figure out where to download from -- be patient:
wget http://sourceforge.net/projects/strace/files/latest/download?source=files
unxz strace-4.6.tar.xz
tar -xvf strace-4.6.tar
cd strace-4.6/
./configure
make
sudo make install


How to use:
While I've spent a couple of years with Debian I'm a CentOS newbie, and I keep being confused about the location of the libs -- for my compiles I need to put libs in /usr/lib, but to execute I seem to need to put symlinks in /usr/lib64. strace can help you track where a program is looking for its libs

e.g. to see what the program sinfo is up to
 strace -o sinfo.log sinfo

Here is a snippet from sinfo.log:

open("/lib64/libc.so.6", O_RDONLY)      = 3
open("/usr/local/lib/sinfo/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/openmpi/lib/librt.so.1", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/librt.so.1", O_RDONLY)     = 3
open("/usr/local/lib/sinfo/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/opt/openmpi/lib/libdl.so.2", O_RDONLY) = -1 ENOENT (No such file or directory)
open("/lib64/libdl.so.2", O_RDONLY)     = 3

You can see that it e.g. looks for libdl.so.2 first in /usr/local/lib/sinfo, then in /opt/openmpi/lib/ and finally finds it in /lib64