General Overview of SGE and Condor
These resources can be used with general purpose clusters like prism, tempest, and phoenix.
You can look at the job restrictions for each cluster at the following websites.
Prism
Phoenix
Tempest
SGE is a system used on a lot of different clusters in SCS. SGE is used to run dedicated parallel programs for users. Users can run on a general purpose cluster (phoenix or gp), or if they work for the owner of the cluster, they can run on a particular cluster.
Condor is used as a scavenger cluster. There are three main phases of a condor pool system. First there is what is know as a central manager, then there is a submit node, and finally there is a compute node.
Central Manager
The central manager is like the heart/brain of the condor system. It provides both the collector and the negotiator which play major rolls in attaching jobs to compute nodes, and collecting usage information form nodes.
Submit Node
The submit nodes are typically the head nodes of a cluster (eg. phoenix) at these nodes you can do a range of things like. Such as looking at what jobs are currently running, compiling a program against the condor libraries, checkpointing or submitting a job, or even removing a job from the queue.
Compute Nodes
Compute nodes can literally be anything. The compute node is what ends up doing all the computations involved in the job that was submitted. Compute nodes can be cluster nodes, desktop computers, or even classroom computers.
SGE Usage
As was stated before, SGE is primarily used for running parallel programs on a cluster.
There is a tar ball with examples
here.
You can compile the codes using this type of statement.
mpiCC mpiHello1.cc -o H1
You can then submit the job using the script and statement.
qsub Hello1.sh
After that, you can check the status of your job by using.
qstat
More information about SGE can be found
here.
Condor Usage
Condor is a scavenger cluster, which means it hunts for unused cycles on compute nodes. Then it attempts to make use of those by running jobs that users have submitted. Condor can be used to run parallel jobs, but we have not implemented that in this situation. Condor also has a nice feature that allows the user to restrict what systems their jobs run on.
You can find a tar ball with some examples
here.
This tar ball contains 4 different scripts, two for each example program. They can be submitted in the two universes that are supported by the condor system that we have set up. These universes are:
Standard, and
Vanilla
The major difference between these is that standard allows the user to checkpoint the job, and allows the job to migrate. This is very useful because if a job is being run on someone's desktop computer and the user comes back to the computer, the job gets evicted. When this happens you can loose everything, if the job is in the vanilla universe. If the job is in the standard universe then the job can be migrated to a new compute node and save all the used cycles. Also, the job can be checkpointed to allow a job to start back up at some point if, for instance, the power goes out.
In the Scripts, you need to add in the path to the executables. After this you need to compile the program that you want to run.
If the program is going to be run in the standard universe, you have to compile using condor_compile.
condor_compile should be fully compatible with any compiler that you may choose to use, even make files. You can use it like this:
condor_compile gcc test.c -o test
If you are going to run your job in the vanilla universe then you can simply compile like you normally would to run the program on your computer.
After the job is compiled, you can submit the job with the script like this:
condor_submit script_name
To check the status of your job, you can use this command:
condor_status
If your job is not running as you would like, or you want to kill it for some reason, you can use the:
condor_rm
command, and this command has two usages, one you give it the process id, and the other you give the username.
If you provide a username for the
condor_rm command, then all the jobs for the username will be removed.
There is an example script showing how to use restrictions, you can restrict against things like Operating system, Architecture, and Machine group. These allow you to make sure your code will run on computers that you want them to, if you want to selectively restrict your code to run on one cluster you can do that, or will be able to in the short future.
More information about condor can be found
here.