Fully Utilizing Your X-Core CPU - Page 2

3. Logging

When ppss.sh is working, a log directory is created right under the working directory. When something does not work as expected, you should have a deep look into this directory, which by default is named ppss. There is a log of ppss.sh itself, and also logs of the commands it performs. If all works well the ppss directory could be deleted after execution.

4. Distributed ppss.sh

One feature of ppss.sh was completely ignored until now. Since version 2.0 it is able to parallelize such kind of jobs not only over the x-cores of one machines, but distributed over x-cores of x-machines. So we have an easy way to do some HPC computing ! When the nodes of the HPC cluster were equipped with OpenCL or DirectCompute capable graphiccards, a simple way to a kind of numbercrunching HPC cluster might also be possible !

The grid is organized in a master-slave structure, with one master and x slaves. The communication between the master and the nodes in the grid is achieved with the help of ssh. The files to be processed have to be on one place reachable from all nodes, for instance a NFS or SMB share on the master (maybe sshfs also works), but scp could also be used. The OSes on the nodes could be different, there can also be different CPU's in the grid.

Requirements are a bit higher, than in a standalone installation, additionally sshd and screen are needed. Also there has to be a unprivileged account on all machines, with passwordless node to server ssh connection.

Setting up such a gridcluster is detailled described in the wiki on the website, as there are detailled instructions for every aspect of ppss.sh

It's done in 4 steps:

Setup an account and SSH access on the server and all nodes.
Create a list of all nodes.
Create a configfile for ppss on the server, that will be distributed to nodes.
Deploy ppss to the nodes, and run it

I have built a server/2 node configuration (the server itself also being a node, which is not recommended on larger setups) for demonstration purposes, outlined below. The Server is Ubuntu 9.04, the node is Debian 5. I created an account on all systems intentionally named ppss, so the homedirectory is /home/ppss. I decided to make a subdir /home/ppss/mp on the server, which should hold the files to be processed. I also created a subdir /home/ppss/mp on the node to be used as mountpoint, and manually mounted the subdir /home/ppss/mp of the server there by doing a

sshfs ppss@srvr:/home/ppss/mp ~/mp

on the node. This way there are the same paths on boths systems.

The accounts are created by

adduser -m ppss

If you have a lot of nodes clusterssh would simplify this step. You also would be well advised either to keep all UIDs/GID's in sync on all systems, or to use LDAP based authentication, outlined for instance in https://www.howtoforge.com/linux_openldap_setup_server_client

Create the passwordless ssh access like outlined for instance in http://www.debian-administration.org/articles/152 or https://www.howtoforge.com/set-up-ssh-with-public-key-authentication-debian-etch and do a first login (to accept the fingerprint).

Then use your favourite editor to create a file named nodes.txt on the server, one IP adress/hostname on each line.

The most complicated step is to create the configfile. I wanted only to perform the simple task to downmix some mp3 files using lame, like we did in one of the above examples.

Here is an example, run it in the homedir of the ppss User:

ppss.sh config -C config.cfg -c 'lame $ITEM -V 4 -B 160 "${ITEM%.mp3}.downmp3"' -d /home/ppss/mp -m srvr -u ppss -K ~/.ssh/known_hosts -n /home/ppss/nodes.txt -k /home/ppss/ppss-private.key

You should get the configfile, and should now be able to deploy it on the nodes (listed in the configfile), by doing a

ppss.sh deploy -C config.cfg

Afterwards you should be able to start the whole process by doing a

ppss.sh start -C config.cfg

5. Bottleneck

If you build a large gridcluster for HPC numbercrunching, the server with the files may become a bottleneck because he has to do a lot of disk-IO and a lot of networktraffic aggregates on this system. Also the method the filesystem with the files to be processed may be critical. scp/sshfs may be worse than NFS, but NFS also has the reputation to suffer under heavy load. On the roadmap of ppss.sh is the point, to use netcat for transfering the data over the net. netcat has very little overhead, so this should help. An alternative might be pNFS coming with NFS 4.1 (coming with Linux kernel 2.6.30).

But the whole system, favourably the IO subsystem and the network interface should be as fast as possible. Maybe it's worth thinking about RAID and NIC-bonding.

6. Alternatives

A simple alternative to ppss.sh doing parallelized processing, which should not be unmentioned here, is xargs in piped cooperation with find. This way you can also perform parallelized processing in the above style, but only on the local system, and you have to decide yourself howmuch concurrent processes you run.

A simple example looks like

find . -name '*.wav' | xargs -n1 -P2 oggenc

Another way to invoke parallel processes on more than one host, but interactively, are pdsh, or clusterssh. With this kind of apps you simultaneously login into a number of hosts, and can issue the same commands, but it might be difficult to perform such batch operations as with ppss.sh are possible. But clusterssh is a good mean to simplify administraton of all nodes in the grid outlined above.

7. HPC

There is a lot of stuff floating around in the net regarding HPC on Linux. The most famous is the Beowulf project launched in the mid of the 90's.

A good entrypoint into the HPC topic is LinuxHPC.

A distribution which might be suited well to build a single-user Ad-Hoc HPC cluster might be PelicanHPC. PelicanHPC is a Live-Linux which could be started from a CD/DVD and has an integrated PXE Server, from where all nodes could easily boot. So a single-user HPC cluster could be implemented without installing a byte on one of the involved systems.

Another lightweight distribution for building HPC clusters is CAOS. CAOS has perceus integrated which is a cluster management system. Warewulf, which is also integrated, does Monitoring.