Condor at SCS
About Condor
Condor is a batch system for queuing, scheduling, and prioritizing compute-intensive jobs. It is developed by the
Condor team at the University of Wisconsin. In a nutshell, Condor matches user submitted jobs to available computer resources. These resource could be desktop machines or owner-based clusters. As long as the machine has the required resources the job will be sent to the machine and run. The Condor
documentation is full of useful information about the software and how the system works. The following page summarizes bits of the condor documentation that are relevant to using Condor at SCS.
Prepare your program
A job run under Condor must be able to run as a background batch job. Condor can redirect console output and keyboard input to and from files for you. Create any needed files that contain the proper keystrokes needed for program input. Before submitting your job, it is a good idea to make certain the program can run correctly with the input files you have created.
Choose a condor universe
Condor has several run-time environments, which are referred to as
universes. SCS supports the
Standard and the
Vanilla universes. The Standard universe allows remote system calls and jobs can checkpoint and migrate. This is useful, when the job is running on a desktop system and the user returns. Instead of the job being killed, it will save an image of itself and move to another compatible node that is not being used. The Standard universe requires that the program be linked to the condor libs; therefore, if you do not have the object code you may be restricted to the vanlilla environment. The vanilla universe does not require that you have the program object code, but as a consequence jobs run in the vanilla universe do not checkpoint or migrate. Vanilla jobs will be evicted from a machine if a user returns.
Compiling code for condor (Only Standard Universe)
If you choose to use the standard universe in condor, your program has to be linked to the condor libs. So you have to have either the object code or the source code for the program. To compile your program you have to log on to a submit node (eg. Phoenix) and compile using the following command:
condor_ compile cc | CC | gcc | f77 | g++ | ld | ...
You would just append the
condor_compile command with the normal command you would use to compile your program.
Example 1
If your source code is called test.c you would compile using this command:
condor_compile gcc test.c -o test
and this will create an executable called test that can now be run in the standard universe for condor.
Create a submit description file
A "Submit description file" contains commands and keywords to direct the queuing of jobs. In this file, condor finds everything it needs to know about the job(s). Items such as the name of the executable to run, the initial working directory, and command-line arguments to the programs all go into the description file.
Example 1
A very simple submit description file may take the following form.
####################
#
# Example 1
# Simple condor job description file
#
####################
Executable = foo
Log = foo.log
Queue
Example 2
Here's a slightly more complicated submit description file:
####################
#
# Example 2: demonstrate use of multiple
# directories for data organization.
#
####################
Executable = mathematica
Universe = vanilla
input = test.data
output = loop.out
error = loop.error
Log = loop.log
Initialdir = run_1
Queue
Initialdir = run_2
Queue
Example 3
The vanilla environment allows jobs to be run on heterogeneous architectures. Instead of specifying the executable explicitly, a macro is included that will be expanded when a machine is available and matched with your job.
####################
#
# Example 3: demonstrate heterogeneous submit
# file.
#
####################
initialdir =/home/u5/users/jwilgenb/condor/RepFiles
Rank = kflops
Executable = /usr/common/i686-linux/bin/paup.$$(OpSys).$$(Arch)
Universe = vanilla
requirements = (OpSys =="OSX" && Arch =="PPC") || \
(OpSys =="WINNT51" && Arch =="INTEL") || \
(OpSys =="LINUX" && Arch =="INTEL") || \
(OpSys =="LINUX" && Arch =="ALPHA")
should_transfer_files = YES
when_to_transfer_output = ON_EXIT_OR_EVICT
transfer_input_files = anolis.nex, rep.$(Process)
notification = NEVER
arguments = rep.$(Process) -n -f
output = rep_out.$(Process)
error = rep_error.$(Process)
log = rep.log
Queue 5
Submitting your job
The program
condor_submit is used to submit jobs to the SCS Condor cluster. At SCS, you can submit jobs from
submit.scs.fsu.edu. In addition, most of the owner-based cluster head-nodes can be used to submit jobs to the condor "flock."
The
condor_submit application requires a submit-description file, which contains the commands needed to match jobs with an execute node.
To submit a job log in to a submit node (e.g., submit) and type the following:
condor_submit "submit description file"
If the job is successfully submitted, then you will something like this:
Submitting job(s)..
Logging submit event(s)..
2 job(s) submitted to cluster 56836.
Manage your job (frequently used commands)
After submitting a job, condor provides a number of commands for managing the job and monitoring its status. For example:
- condor_q display information about jobs in queue.
condor_q
display information about jobs in queue.By default,it only queries the local job queue
condor_q -global
query all job queues in the pool
"condor_q -g" has the same fuction.
If the display list is too long, you can use "condor_q | less" to show it page by page.
condor_q -name "scheed name"
cause the queue of the named schedd to be queried
example "condor_q -name petal"
condor_q -submiter "submitter name"
List jibs of specific submitter from all the queues in the pool.
example "condor_q -submitter yanfeng" (yanfeng is my user name) will show you all
the running jobs submitted by yanfeng
"condor_q yanfeng" has the same fuction.
condor_q -run
get information about runing job.
example " condor_q -run yanfeng" will show you all the running jobs submitted by yanfeng
"condor_q -r" has the same fuction.
condor_q -help
get a brief description of the supported options
- condor_status is a versatile tool that may be used to monitor and query the condor pool.
condor_status
Display the status of the Condor pool
condor_status -avail
indentify resources which are avaiable.
condor_status -schedd
Query condor_schedd ads and display attributes
condor_status -help
get a brief description of the supported options
- condor_rm remove Jobs from the condor queue
condor_rm username
remove one or more jobs from the condor job queue
example "condor_rm yanfeng" remove all the jobs submitted by yanfeng
condor_rm cluster
remove all jobs in the specified cluster
condor_rm cluster.precess
remove the specific job in the clustal
condor_rm -help
get a brief description of the supported options
- condor_hold hold your job
condor_hold cluster
Hold all jobs in the specified cluster
condor_hold cluster.process
Hold the specific job in the cluster
condor_hold user
Hold all jobs belonging to specified user
condor_hold -help
get a brief description of the supported options
- condor_release release held jobs in the Condor queue
condor_release cluster
Release all jobs in the specified cluster
condor_release cluster.process
Release the specific job in the cluster
condor_release user
Release jobs belonging to specified user
condor_release -help
get a brief description of the supported options
- condor_prio change priority of jobs in the condor queue
condor_prio [{+|-}priority ] cluster
change priority for all processes belonging to the specified cluster.
The user can also adjust the priority by supplying a + or - immediately followed
by a digit. The priority of a job can be any integer, with higher numbers corresponding
to greater priority. Only the owner of a job or the super user can change the
priority for it.
example "condor_prio +2 56639" will change the cluster priority to +2. You can check with "condor_q"
condor_prio [{+|-}priority ] cluster.process
change the priority of the specified process.
condor_prio [{+|-}priority ] user
change priority of all jobs belonging to that user.
condor_prio -help
get a brief description of the supported options
For a more complete list of condor submit file examples try visiting the
Condor Project Website.
Examples (applications used at SCS)
Migrate
MrBayes
PAUP
CHARMM
Adding your machine to the SCS Pool
If you would like to add your machine to the SCS Condor Pool, please contact
TSG.
Condor help
If you need help with Condor, there are several resources that you can tap for useful information. The University of Wisconsin-Madison maintains a
mailing list for Condor users. The list is regularly monitored by the Condor development team. You can also subscribe to the SCS Condor
mailing list. Please, read through the documentation on the SCS TSG twiki site, the University of Wisconsin Condor
homepage, and the mailing list archives before posting a question.
Page information
Known Issues
If you are submitting a queue of jobs from an NFS mount, and you are trying to save a single log file, you will most likely run into an issue where some of your jobs stall. We have noticed this issue from queues running from around 20 jobs to 200 jobs, but it may happen for any number of jobs. This issue essentially takes down the submit node that you are working on and will be down until we get around to fixing it. If you want to avoid this, you can either not print any log files or replace you log print line to the following.
log = log.$(Process)
Or something similar. This will produce one separate log file for each of your running jobs in the queue.
If you don't need the log files though, we recommend just removing the line.
Another way to bypass this issue, is to write the log file onto a local disk of the submission machine. You can do this with the following line.
log = /tmp/JOBNAME.Log
Then after your job is complete, you can look in /tmp to see your job log and you can remove it when you are done.