Operations > Computing > FAQ

Linux Lab Parallel How-To


Docs and support for our varying parallel libs can be found here:
  • Open MPI - The MPI implementation currently supported in the linux lab.

    Other userful mpi references.
  • MPICH2
  • MPICH2 wiki FAQ
  • PVM
  • LAM-MPI


  • As the linux lab provides /tmp space, as well as shared storage between all the nodes in the form of your home directory, many programs can be run in parallel simply by launching a small script. Examples of such scripts can be found in the /home/lab.apps/ex_jobs/ directory on any of the linux lab nodes.

  • In order for this to work, you must have a passwordless ssh keys setup (this is a security concern, only have these in place when using these parallel computing scripts or similar). Also make sure you've already logged into each of the machines on which you want to run jobs so you won't be prompted to accept host fingerprints.

  • The example scripts envolving writing to files require the user to have an output directory in their home directory. I.e. ~/output

    LAM-MPI


    For further documentation/examples go to http://www.lam-mpi.org/tutorials
    LAM is a simple yet powerful environment for running and monitoring
    MPI applications on clusters. The few essential steps in LAM
    operations are covered below.

    Lam Quick Start Guide

    1. You will need a lam-bhosts.def or similar file, it should contain one node shortname per line, i.e:
      linux4
      linux5
    2. Make sure you have a passwordless ssh keys setup (this is a security concern, only have these in place when using lam or similar). Also make sure you've allready logged into each of the machines in your lam-bhosts.def so you won't be prompted to accept host fingerprints.
    3. lamboot lam-bhosts.def from a machine in your lam-bhosts.def to start your cluster environment.
    4. lamnodes to make sure everythings up running as you expect.

    Booting LAM


    The user creates a file listing the participating machines in the cluster. There's a default file listing all the machines on the cluster: /usr/local/lam-mpi/etc/lam-bhost.def
    % cat lamhosts
    # a 2-node LAM
    linux0
    linux1
    

    Each machine will be given a node identifier (nodeid) starting with 0 for the first listed machine, 1 for the second, etc.

    The recon tool verifies that the cluster is bootable:

    % recon -v lamhosts
    recon: -- testing n0 (linux0)
    recon: -- testing n1 (linux1)
    

    The lamboot tool actually starts LAM on the specified cluster.
    % lamboot -v lamhosts
    
    LAM 6.5.6 - University of Notre Dame
    
    Executing hboot on n0 (linux0 - 1 CPU)...
    Executing hboot on n1 (linux1 - 1 CPU)...
    

    lamboot returns to the UNIX shell prompt. LAM does not force a canned environment or a "LAM shell". The tping command builds user confidence that the cluster and LAM are running.
    % tping -c1 N
      1 byte from 1 remote node and 1 local node: 0.008 secs
    
    1 message, 1 byte (0.001K), 0.008 secs (0.246K/sec)
    roundtrip min/avg/max: 0.008/0.008/0.008
    

    Compiling MPI Programs

    You can take a look at MPI examples in /usr/local/lam-mpi/examples Refer to MPI: It's Easy to Get Started to see a simple MPI program. mpicc (and mpiCC and mpif77) is a wrapper for the C (C++, and F77) compiler that includes all the necessary command line switches to the underlying compiler to find the LAM include files, the relevant LAM libraries, etc.
    % mpicc -o foo foo.c
    % mpif77 -o foo foo.f
    

    Executing MPI Programs

    A MPI application is started by one invocation of the mpirun command. A SPMD application can be started on the mpirun command line.
    % mpirun -v -np 2 foo
    2445 foo running on n0 (o)
    361 foo running on n1
    

    An application with multiple programs must be described in an application schema, a file that lists each program and its target node(s).
    % cat appfile
    # 1 master, 2 slaves
    n0 master
    n0-1 slave
    
    % mpirun -v appfile
    3292 master running on n0 (o)
    3296 slave running on n0 (o)
    412 slave running on n1
    

    Monitoring MPI Applications

    The full MPI synchronization status of all processes and messages can be displayed at any time. This includes the source and destination ranks, the message tag, count and datatype, the communicator, and the function invoked.
    % mpitask
    TASK (G/L)    FUNCTION      PEER|ROOT  TAG    COMM   COUNT   DATATYPE
    0/0 master    Recv          ANY        ANY    WORLD  1       INT
    1 slave       <running>
    2 slave       <running>
    

    Process rank 0 is blocked receiving a message consisting of a single integer from any source rank and any message tag, using the MPI_COMM_WORLD communicator. The other processes are running.
    % mpimsg
    SRC (G/L)   DEST (G/L)   TAG   COMM    COUNT   DATATYPE    MSG
    0/0         1/1          7     WORLD   4       INT         n0,#0
    

    Later, we see that a message sent by process rank 0 to process rank 1 is buffered and waiting to be received. It was sent with tag 7 using the MPI_COMM_WORLD communicator and contains 4 integers.

    Cleaning LAM

    All user processes and messages can be removed, without rebooting.
    % lamclean -v
    killing processes, done
    sweeping messages, done
    closing files, done
    sweeping traces, done
    

    It is typical for users to mpirun a program, lamclean when it finishes, and then mpirun another program. It is not necessary to lamboot to run each user MPI program.

    Terminating LAM

    The lamhalt tool removes all traces of the LAM session on the network. This is only performed when LAM/MPI is no longer needed (i.e., no more mpirun/lamclean commands will be issued).
    % lamhalt
    

    In the case of a catastrophic failure (e.g., one or more LAM nodes crash), the lamhalt utility will hang. In this case, the wipe tool is necessary. The same boot schema that was used with lamboot is necessary to list each node where the LAM run-time environment is running:

    % wipe -v lamhosts
    Executing tkill on n0 (linux0)...
    Executing tkill on n1 (linux1)...
    

    PVM: http://www.csm.ornl.gov/pvm/
    LAM-MPI: http://www.lam-mpi.org


    PVM


    PVM examples can be found in /usr/local/pvm/examples/misc.
    Here are the steps in running pvm:
    Copy the test files to /condor/<username>
    cp $PVM_ROOT/examples/* /condor/<username>
    
    Modify path for hello_other to full path of where hello_other will be located, eg:
           /condor//hello_other
    
    Run the command to start pvm program on all the machines.
    $PVM_ROOT/pvm $PVM_ROOT/hostfile
    in the interactive pvm you can add additional pvm hosts by 'add hostname'
    type 'quit' in the interactive pvm - pvm will be running in the background
    
    Compile the test files in /condor/
    gcc -o hello hello.c -I$PVM_ROOT/include -L$PVM_ROOT/lib/$PVM_ARCH -lpvm3
    gcc -o hello_other hello_other.c -I$PVM_ROOT/include -L$PVM_ROOT/lib/$PVM_ARCH -lpvm3
    Execute 'hello' program
    When done using pvm, run 'pvm' and enter 'halt'
    

    More information about pvm can be found at the PVM site.
  • Operations  
    EE logo