LONI Pipeline | Server Preferences Configuration

Server Preferences Configuration

NOTE: this document is last updated for Pipeline version 5.1.0 and is no longer maintained. It is for references only. For the latest documentation on this subject, please refer to our guide on graphic server configuration tool.

Hostname
Temp file location
Secure Temp File Location
Port number
Maximum Simultaneous Jobs
Using privilege escalation
Server Library
Days to persist status & Clear old temp files enabled
Log file location
Persistence URL
HTTP server port
Failover
Directory-based executable access control
1. Mode
2. Users
3. Paths
4. Examples
Grid
User Management
1. Limit Percent

We need to setup our preferences file. When you run the server for the first time, it should have created a directory where all your preferences and logs will be stored. Depending on your operating system, you can find this directory in one of the following locations:

Linux/Unix - $HOME/.pipeline/8001/
OS X - $HOME/Library/Preferences/Pipeline/8001/
Windows - %HOME%\Application Data\LONI\pipeline\8001\
Windows Vista/Seven - %HOME%\AppData\LONI\Pipeline\8001\

Open up your favorite text editor, and paste in the following sample preferences file:

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
</preferences>

Save the file out as "preferences.xml" When you launch the server, it will have the host name "cranium.loni.ucla.edu" and all Temporary files will be in /ifs/tmp directory. Now let's look all the options supported by Pipeline.

3.1 Hostname

The <Hostname> element specifies the hostname of the computer that you want the server to run on. Ironically, this element requires the fully qualified domain name of the computer that it is on, not just the hostname. For example, "mycomputername" would be a hostname, whereas, "mycomputer.labname.university.edu" would be my fully qualified domain name.

3.2 Temp file location

The <TempFileLocation> element specifies where all intermediate files for all the executed programs are stored on the computer. This directory should be accessible from the Pipeline server as well as compute nodes. The Pipeline server will create a structure under there, and the compute nodes will read from and write to that directory. For example if you specify <TempFileLocation>/ifs/tmp</TempFileLocation> Pipeline will create a directory /ifs/tmp/username/timestamp and put all the working files there. Where username is the user that is running the server and timestamp is the time at which each workflow gets translated before execution. Inside each of those 'timestamp' folders will be all the intermediate files produced by executables from submitted workflows. Depending on the number of users using your server and the kind of work they do, this directory can balloon up very quickly. This property will be discontinued for versions above 4.5. Instead property SecureTempFileLocation should be used.

3.3 Secure Temp File Location

This property is supported starting from version 4.5. This is the same property as Temp file location except with difference that files will be stored in a secure way. In order to use secure temp file location you need to specify a non existing directory ( which will be created only by Pipeline ). For example <SecureTempFileLocation>/ifs/tmp/SecureTmpDir</SecureTempFileLocation> Pipeline will create a directory /ifs/tmp/SecureTmpDir with special bits and then upon workflow submission it will create /ifs/tmp/SecureTmpDir/username/timestamp directory and put all the working files there. This directory will have special permissions which will allow files to be accessible only for the user who started the workflow. IMPORTANT: Starting from version 5.0 it is mandatory to use this option instead of TempFileLocation when UsePrivilegeEscalation is TRUE.

3.4 Port number

If no port number is specified in the preferences, then the server will attempt to list on port 8001. If you want to change the port number use the <ServerPort> element in your preferences.xml file:

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
</preferences>

3.5 Maximum Simultaneous Jobs

As your server becomes busier and busier, at times you will have users submitting more jobs at once than your server has enough capacity to handle. In order to prevent your system or cluster from coming to a grinding halt, you can set the maximum number of simultaneous jobs in the preferences. By default, the Pipeline server will set this value equal to the number of cores/cpus that you have available in your computer. For example, a computer with dual processor quad core chips, will have a maximum number of simultaneous jobs of 8. If you want to change this (because you have a grid available) you can set this preference to any value you want.

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
</preferences>

Take note that this will not reject jobs submitted by users after the limit has reached. It will just queue them up until there is an available slot for execution. For grid setups, you should probably have the limit be a little higher than the number of compute nodes available to you, because submitting to the grid takes a non-negligible amount of time, and it's best to keep your compute nodes crunching at all times.

3.6 Using privilege escalation

When you have different users connecting to your Pipeline server, you might want to enforce different access restrictions on each user. If you're running your Pipeline server on a Linux/Unix based system (including OS X), you can enable privilege escalation which will make the Pipeline server issue commands as the user who submits a workflow for execution. For example, if user 'jdoe' connects to a Pipeline server with privilege escalation enabled, any command that is issued on behalf of that user will be prefixed with 'sudo -u jdoe '. This way all the files that are accessed and written by the user on the Pipeline server will be done on behalf of 'jdoe'. Remember, there is no harm in not enabling privilege escalation on your Pipeline server. All files will simply be created and read as the Pipeline server user. You will be giving uniform access to your system to all users. Additionally, it makes it easy to lock down the access of all Pipeline users because you only have to lock down the access of one actual user on your system; the Pipeline user. In order to enable this feature in the Pipeline, you need to do two things. 1) Add the <UsePrivilegeEscalation> preference to your preferences file with a value of "true" and 2) modify your system's sudoers file to allow the user that runs the Pipeline server to sudo as any user that will be allowed to connect to the system. How to modify the sudoers file is outside of the scope of this guide, but if you want/need this feature you probably already know how to do it. Now your preferences should look something like this:

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
<UsePrivilegeEscalation>true</UsePrivilegeEscalation>
</preferences>

3.7 Server Library

When Pipeline client users connect to a server, the client syncs up the library of module definitions available on that server. The location of that library on the server is specified by the <ServerLibraryLocation> element in the preferences. By default, the location is set to one of the following locations (based on OS), so you don't need to specify this preference if you're happy with it:

Linux/Unix - $HOME/documents/Pipeline/ServerLib/
OS X - $HOME/Documents/Pipeline/ServerLib/
Windows - %HOME%\Application Data\LONI\pipeline\ServerLib\
Windows Vista/Seven - %HOME%\Documents\Pipeline\ServerLib\

When the server starts up, it reads in all the .pipe files in the ServerLibraryLocation directory (and all its subdirectories) and monitors it for changes/additions in any of the files while it runs. Starting from Pipeline v5.0 there is a new preference ServerLibrarySameDirMonitor which allows you to specify another monitoring file/directory other than library directory. ServerLibrarySameDirMonitor is a boolean value preference which by default is set to true. When it is set to false Pipeline will look for alternate monitoring file path which should be declared in ServerLibraryMonitorFile preference. For example let's check following preferences


<ServerLibraryLocation>/ifs/lib</ServerLibraryLocation>
<ServerLibrarySameDirMonitor>false</ServerLibrarySameDirMonitor>
<ServerLibraryMonitorFile>/ifs/monitorFile</ServerLibraryMonitorFile>

If we specified ServerLibrarySameDirMonitor as false then Pipeline server will not update its library when something is changed in ServerLibraryLocation ( /ifs/lib ), it will only update when ServerLibraryMonitorFile (/ifs/monitorFile) file will be modified. Put all the module definitions that you want to make available to users into ServerLibraryLocation directory, and when clients connect they will obtain a copy of the library on their local system. If you add/delete/change any of the definitions in this directory, the server will automatically see the change (no restart required) and synchronize clients again when they reconnect. Even when clients are connected during the change, they will get the new version of ServerLib without reconnection. Remember that changes should be affected on the root directory, otherwise server will not notice the change and Server Library files will not be updated. For example If you have a pipe file somewhere like ServerLib->LONI->Modules->example.pipe and you have a change only in example.pipe file. Although the "Modules" directory and "examples.pipe" will have new "modified time", the ServerLib (which is the root in our case) directory will not change its modification time, so in this case you have to manually change the ServerLib modification time. After updating server library files go and check the Output Stream of the server, you should see a log like this

Loading server library..........................DONE [1100ms]

If this log exists, it means that the server captured the change in the library, otherwise will mean that library has not been updated.

3.8 Days to persist status & Clear old temp files enabled

The <DaysToPersistStatus> specifies number of days a workflow can be running. Every 24 hours, the Pipeline server will check and cleanup workflow sessions older than the number of days specified. The default value is 30 days. If a session is cleared, all its temporary files under the temporary directory will be removed. If <ClearOldTempFilesEnabled> is set to true, then any temporary session directory that are older than two times the <DaysToPersistStatus> will be removed. This will not happen under normal circumstances, because persistence database keeps track of all sessions, and no temporary directories older than <DaysToPersistStatus> should exist. It only applies when Pipeline server restarts with its persistence database manually deleted. The default is false.

3.9 Log file location

If you want to explicitly set the directory location that your log files will write to, you can specify the path using the <LogFileLocation> preference. In order to define the prefix in which the log files will be named, simply add that to the end of the directory path. The unique number denoting the log file will be appended onto the file name.

<LogFileLocation>/nethome/users/pipelnv4/server/events.log</LogFileLocation>

In the above example, log files will be created in the /nethome/users/pipelnv4/server/ directory, and will be named events.log.0, events.log.1, and so forth.

3.10Persistence URL

Pipeline server uses hsqldb to store information, including workflow status and module status. By default, it is stored in Pipeline server's memory, and will be removed when the Pipeline server stops. Alternatively, you can start a hsqldb server and make it save to an external file. You can go to hsqldb website to download the jar file. To start a hsqldb process, run something like this: java -cp ./lib/hsqldb.jar org.hsqldb.Server -database.0 file:/user/foo/mydb -dbname.0 xdb After successfully starting hsqldb, you can put <PersistenceURL> to Pipeline server's preference file, something like the following: <PersistenceURL>jdbc:hsqldb:hsql://localhost/xdb</PersistenceURL>

3.11 HTTP server port

The <HTTPServerPort> specifies the port number in which the Pipeline server provides API for querying workflow data, including session list, session status, output files. It is helpful when you (or your program) want to query workflows on Pipeline server, without the need of Pipeline client. Please note, once enabled, it does not require any login authorization to see any workflows on the server. By default, this feature is not enabled on the Pipeline server. For example, we have a preference file like this:

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cerebro-rsn2.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<HTTPServerPort>8021</HTTPServerPort>
</preferences>

When the server is running, you can go to http://cerebro-rsn2.loni.ucla.edu:8021/ and it shows an XML file listing all the APIs. Currently there are five functions:

getSessionsList
getSessionWorkflow
getSessionStatus
getInstanceCommand
getOutputFiles

getSessionsList returns all the active sessions on this Pipeline server. It does not take any argument, and the query URL looks like this: http://cerebro-rsn2.loni.ucla.edu:8021/getSessionsList The Pipeline server returns an XML file listing all the active sessions, with their session IDs.

<sessions count="1">
<session>
cerebro-rsn2.loni.ucla.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492
</session>
</sessions>

getSessionWorkflow returns the workflow file (.pipe file). It takes session ID as argument. The query URL looks like this:

http://cerebro-rsn2.loni.ucla.edu:8021/getSessionWorkflow?sessionID=cerebro-rsn2.loni.ucla.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492

getSessionStatus returns the status of the workflow execution, when it started, if it has finished, what time it finished, what are the nodes and instances in this workflow, and for each node, if they finished successfully. The query URL looks like this:

http://cerebro-rsn2.loni.ucla.edu:8021/getSessionStatus?sessionID=cerebro-rsn2.loni.ucla.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492

getInstanceCommand returns the command of the execution. It takes session ID, node name (which can be found by calling getSessionStatus), and instance number (which can also be found by calling getSessionStatus). The query URL looks like this:

http://cerebro-rsn2.loni.ucla.edu:8021/getInstanceCommand?sessionID=cerebro-rsn2.loni.ucla.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492&nodeName=BET_0&instanceNumber=0

getOutputFiles returns the path of output files generated by the node. It takes session ID, node name, instance number, and parameter ID. The query URL looks like this:

http://cerebro-rsn2.loni.ucla.edu:8021/getOutputFiles?sessionID=cerebro-rsn2.loni.ucla.edu:8020-453da129-c81b-4473-9fc0-8fe03481e492&nodeName=BET_0&instanceNumber=0&parameterID=BET.OutputFile_0

3.12 Failover

Starting from version 4.2 Pipeline has failover feature which is supported only on UNIX/Linux machines. Failover capabilities have been implemented in Pipeline 4.2, improving robustness and minimizing service disruptions in the case of a single Pipeline server failure. This was achieved by using two actual servers, a primary and a secondary, a virtual Pipeline Server name, and de-coupling and running the Persistence Database on a separate system. The two servers monitor the state of its counterpart. In the event that the primary server with the virtual Pipeline server name has a catastrophic failure, the secondary server will assume the virtual name, establish a connection to the Persistence Database and take ownership of all current Pipeline jobs dynamically. Requirements Minimum of 3 separate hosts. Virtual IP address of the server. User who runs pipeline should have full access to execute command ifconfig Instructions how to configure failover Copy pipeline server stuff to two different hosts, let's say Host A and Host B. Also we will need database to be in third node ( Host C ). Let's say address of server will be server1.loni.ucla.edu and address of database will be database.loni.ucla.edu:9002 with name xdb. Steps to configure failover 1. Open preferences.xml file of both hosts ( Host A and Host B ) 2. Add Hostname preference ( for example virtualName.loni.ucla.edu) 3. Add Failover Enabled option and set it to true 4. Optional: Add Failover Check Interval preference 5. Optional: Add Failover Retries preference 6. Optional: Add Failover Alias Interface preference 7. Optional: Add Failover Alias Sub Interface Num preference 8. Add Persistence URL preference and make it to point to the database server Host C ( for our example jdbc:hsqldb:hsql://database.loni.ucla.edu:9002/xdb 9. Save files and close them 10. Before starting servers go to Host C and start the database 11. Start the server on Host A. Now you have a configured server. On startup it will check for specified hostname address if there is already a Pipeline running with same address. For our case it will ensure that it is the first server started and will switch to Master mode and will be fully functional. 12. Check the output stream of Host A's server and ensure that server successfully started. 13. Go to Host B and start the server. This server will check for specified hostname address and if it is already in use ( in our case it should be ) it will switch to Slave mode and will wait until Host A crashes. When Host A will go down, this server will wake up and continue Host A's work. How it works Server of Host B pings to server Host A every milliseconds. When there is no response it retries pings for times and if all retries are unsuccessful then Host B creates an IP alias on network interface specified by and and switches to Master mode.

3.12.1 Failover Enabled

The <FailoverEnabled> indicates that server enabled failover feature. It accepts boolean values true or false By default, if this preference does not exist, Pipeline sets it to false.

3.12.2 Failover Check Interval

The <FailoverCheckInterval> specifies the time in milliseconds for Secondary server to ping to Master server. If nothing specified, Pipeline will use default value which is 5000.

3.12.3 Failover Retries

The <FailoverRetries> specifies the number of retries before starting secondary server as master when ping fails. If nothing specified, Pipeline will use default value which is 3.

3.12.4 Failover Alias Interface

The <FailoverAliasInterface> specifies the name of interface on which Pipeline will create a sub interface to do IP Aliasing. If nothing specified, Pipeline will automatically find the primary network interface and first available sub interface number and will add IP Alias on it. For example if your primary interface is eth0 and eth0:0 and eth0:1 are busy with another IP addresses, Pipeline will use eth0:3. WARNING: If one of sub-interfaces contains IP Address of specified Hostname, Pipeline will give an error and exit.

3.12.5 Failover Alias Sub Interface Num

The <FailoverAliasSubInterfaceNum> specifies the number of sub interface on which Pipeline should create the Alias IP Address. If nothing specified, Pipeline will automatically find first available sub interface number and will add IP Alias on it. For example if your primary interface is eth0 and eth0:0 and eth0:1 are busy with another IP addresses, Pipeline will use eth0:3. WARNING: If one of sub-interfaces contains IP Address of specified Hostname, Pipeline will give an error and exit.

3.13. Directory-based executable access control

To improve security, directory-based Boolean access control for permitted executables was implemented. This is an extra layer on top of operating system's authentication and access control. Restricted users are not allowed to run executables outside the specified directories, and/or not allowed to browse the file system using remote file browser.

3.13.1 Directory Access Control Mode

The <DirAccessControlMode> is an integer which indicates the access control configuration for running executables and remote file browser. Below is a matrix chart for different mode and their meaning.

Mode	Remote File Browser Access Control	Executables Access Control
0	Never	Never
1	Never	No with exceptions
2	Never	Yes with exceptions
3	No with exceptions	No with exceptions
4	Yes with exceptions	Yes with exceptions
5	Same as Shell permissions	No with exceptions
6	Same as Shell permissions	Yes with exception
7*	Same as Shell permissions	Same as Shell permissions

* Available starting from Pipeline version 4.2.2

Never means Pipeline server will not do any access control restrictions for any user. Note this will not affect operating system's authentication and access control, in other words, the credentials required to connect to the Pipeline server and the rights required to execute programs will not be affected by the settings here. No with exceptions means access control is not enabled for all users except those marked in Directory Access Control Users will be restricted. Yes with exceptions means all users will be restricted except for those specified in Directory Access Control Users will be allowed. Same as Shell permissions means the remote file browser will act as if user logged in to the server using Shell.

3.13.2 Directory Access Control Users

The <DirAccessControlUsers> is a list of users seperated by commas (i.e. john,bob,mike) which will indicate conditional users. Depending on the Directory Access Control Mode, These users will be restricted or allowed.

3.13.3 Directory Access Control Paths

The <DirAccessControlPaths> is a list of directories separated by commas (i.e. /usr/local,/usr/bin), which will be the only directories allowed for restricted users.

3.13.4 Examples

For example, we want to restrict user john, bob, mike to execute programs only in: /usr/local and /usr/bin, and let every user browse using remote file browser as Shell does, we would have these configurations: <DirAccessControlMode>5</DirAccessControlMode> <DirAccessControlUsers>john,bob,mike</DirAccessControlUsers> <DirAccessControlPaths>/usr/local,/usr/bin</DirAccessControlPaths> Another example, if we want to restrict all users to execute programs only in: /usr/local and /usr/bin, but allow users john, bob, mike to run without restrictions, and let every user browse using remote file browser as Shell does, we would have these configurations: <DirAccessControlMode>6</DirAccessControlMode> <DirAccessControlUsers>john,bob,mike</DirAccessControlUsers> <DirAccessControlPaths>/usr/local,/usr/bin</DirAccessControlPaths>

3.14 Grid

3.14.1 Grid Plugin JAR Files

Starting from version 4.4 developers have opportunity to create their own plugins for Pipeline to communicate with various Grid managers ( see also Pipeline Grid Plugin API Developers Guide ) Pipeline package contains two built-in plugins for Oracle Grid Engine (previously known as Sun Grid Engine) which are JGDIPlugin and DRMAAPlugin. In installed package of Pipeline, under the lib directory there is directory called plugins in which you can find these two plugins. IMPORTANT: Starting from version 4.4 it is required to set grid plugin options in order to make pipeline server to work with Grid resource managers. Otherwise Pipeline server will start all the jobs on the same host where the server is located. To configure Pipeline to use one of default plugins you need to add special tags in preferences.xml file. First tag is <GridPluginJARFiles> which should contain paths to plugin JAR file and the libraries it uses. Paths must be separated by comma. For example if you want to use built in DRMAA or JGDI plugins your prferences file will look like following

JGDI Plugin

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
<GridPluginJARFiles>/usr/pipeline/dist/lib/plugins/JGDIPlugin.jar,
/usr/pipeline/dist/lib/plugins/jgdi.jar</GridPluginJARFiles>
</preferences>

DRMAA Plugin

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
<GridPluginJARFiles>/usr/pipeline/dist/lib/plugins/DRMAAPlugin.jar,
 /usr/pipeline/dist/lib/plugins/drmaa.jar</GridPluginJARFiles>
</preferences>

IMPORTANT: Some plugins require to be defined in class path. For example DRMAA Plugin requires from you to put the path of drmaa.jar in classPath when starting the server. So to start the server with DRMAA plugin you need to have

$ java -cp .:/usr/pipeline/dist/lib/plugins/drmaa.jar Pipeline.jar server.Main

Only this tag is not enough to have plugins enabled and ready to use, you also need to set tag Grid Plugin Class

3.14.2 Grid Plugin Class

This tag should contain the class name of the Plugin used by Pipeline. Following are class names for built in plugins.

JGDI Plugin

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
<GridPluginJARFiles>/usr/pipeline/dist/lib/plugins/JGDIPlugin.jar,
/usr/pipeline/dist/lib/plugins/jgdi.jar</GridPluginJARFiles>
<GridPluginClass>jgdiplugin.JGDIPlugin</GridPluginClass>
</preferences>

DRMAA Plugin

<?xml version="1.0" encoding="UTF-8"?>
<preferences>
<Hostname>cranium.loni.ucla.edu</Hostname>
<ServerPort>8020</ServerPort>
<TempFileLocation>/ifs/tmp/</TempFileLocation>
<MaximumThreadPoolSize>620</MaximumThreadPoolSize>
<GridPluginJARFiles>/usr/pipeline/dist/lib/plugins/DRMAAPlugin.jar,
 /usr/pipeline/dist/lib/plugins/drmaa.jar</GridPluginJARFiles>
</preferences>
<GridPluginClass>drmaaplugin.DRMAAPlugin</GridPluginClass>
</preferences>

3.14.3 Grid Complex Resource Attributes

Pipeline 4.4 has a new feature to checks for the jobs which are submitted by Pipeline but not monitored by it anymore. This happens when the job is in a submission process and the server turns off. When the job submission is complete and Pipeline is down, the job id will not be written in the Pipeline database. Which means that this job will use the slot, but Pipeline will not "remember" the job id. When server restarts it gets the list of running jobs on cluster and compares with its database. To determine which jobs are submitted with current server Pipeline uses Grid Complex Resource Attributes. When Pipeline finds jobs which are submitted by current Pipeline but are out of control, it deletes them to free up the slot. This tag allows you to assign custom complex attributes to all submitted jobs by the server, which will make jobs identifiable. You can have multiple values in the tag seperated by comma. For example

<GridComplexResourceAttributes>pipeline,
serverId=server1</GridComplexResourceAttributes>

Following defines two attributes 1) pipeline which is equal to TRUE and 2) serverId which is equal to server1. This tag is just a definition of complex attributes. In order to use them you have to define _pcomplex in Grid engine native specifications. In our case, the _pcomplex will be replaced with -l pipeline -l serverId=server1 when submitting the job to the grid. Note that the Grid manager has to be configured properly to accept jobs with given resource attributes.

3.14.4 Grid Maximum Submit Threads

Starting from version 4.4 it is possible to configure the number of parallel job submissions. You can specify if you want to submit jobs one by one by setting this parameter to 1 or any other number.

<GridMaxSubmitThreads>10</GridMaxSubmitThreads>

The example will allow maximum of 10 parallel submissions at a time.

3.14.5 Grid engine native specifications

If you have a grid at your disposal, you'll probably want to take advantage of it for your processing. The LONI Pipeline server can do this, if you have an attached plugin to it. Once you've setup your grid engine, you might need to specify a native specification string, that goes along with your job submission (if none of that makes any sense, just skip this preference because you don't need to use it on your server). To set the string for the native specifications make sure that you have set your server to use plugins and place your native specifications string inside <GridEngineNativeSpecification>. On the LONI Pipeline server we use the following native spec. preference: <GridEngineNativeSpecification>-shell y -S /bin/csh -q pipeline.q -l pipeline -N _pjob </GridEngineNativeSpecification> By default, Grid Plugins are disabled, you must set Grid Plugin JAR Files and Grid Plugin Class if you want the Pipeline server to use your grid engine. The native spec you should use for your installation will vary, but if you're using an Oracle Grid Engine (previously known as Sun Grid Engine) installation and you want to use the same string, you'll want to change the -q pipeline.q to reflect the submission queue (if any) that you will be using. Optionally, you can add _pmem and _pstack to the GridEngineNativeSpecification tag. _pmem enables user define maximum memory per module, and _pstack enables user to define the stack size. Both of these can be configured by the user using the latest Pipeline client, and they all use the default set by the grid engine unless user specifies. Starting from version 4.4 if you want to use Grid Complex Resource Attributes you can also add _pcomplex which refers to tag Grid Complex Resource Attributes.

3.14.6 Grid total slots and Grid total slots command

The <GridTotalSlots> specifies number of total grid slots for the cluster. This enables connected user to see how busy is the server in terms of number of running jobs in grid and number of total slots available. Alternatively, you can use the <GridTotalSlotsCmd> tag which contains a command line query to get the total number of available slots for the queue. Refer to your cluster management documentation for the appropriate query. By using this tag, the server will query the grid engine periodically to get the latest number of available slots, and update the number automatically, and broadcast the new number to clients.

3.14.7.1 Grid Job Accounting URL

After Pipeline server restarts some jobs may already been finished or changed their status. This events haven't been caught as Pipeline server was not running at that moment. In order to get the status of "missed" events, Pipeline gets information from configured Sun's Accounting and Reporting Console (ARCo) database. Note this feature is only tested for Oracle Grid Engine (previously known as Sun Grid Engine) with JGDI and DRMAA plugins, if you are using another grid manager, and it does not work, please report it on our Pipeline forum. Assuming ARCo database is configured and running (refer to Sun's website and your system administrator for help). To configure ARCo database in Pipeline you need to put information about the database URL, username, password in preferences.xml file.

<GridJobAccountingURL>jdbc:mysql://hostname/db_name</GridJobAccountingURL>

hostname is the addres of the host where the ARCo database is running ( i.e. arco.loni.ucla.edu ) db_name is the name of database ( i.e. cranium_db )

3.14.7.2 Grid Job Accounting Username

This tag should contain the username to connect to ARCo database. The preferences.xml should have following line.

<GridJobAccountingUsername>username</GridJobAccountingUsername>

3.14.7.3 Grid Job Accounting Password

This tag should contain the password of the specified username declared in <GridJobAccountingUsername>.

<GridJobAccountingPassword>password</GridJobAccountingPassword>

Note that this password is stored as a clear text in preferences.xml, which is not secure. It is recommended to restrict access to preferences file for other users.

3.14.8.1 Grid Use Array Jobs

Sometimes job submission can take very long time when each instance of a module is submitted as an individual job. Array jobs solve this problem and dramatically improve the job submission speed. Starting from version 5.1 it is possible to configure the server so instead of submitting individual jobs it can submit array of jobs. This feature improves the speed of submission especially for modules with multiple hundreds/thousands of instances. Dependent of the module cardinality there is 10%-65% speed improvement when using array jobs versus individual jobs. To enable user management this preference ( GridUseArrayJobs ) has to be set to true. NOTE: As of now, only Pipeline's JGDI Plugin is known as the only plugin which supports array jobs. In order to use array jobs functionality, you'll need to have JGDI Plugin configured for Pipeline.

3.14.8.2 Grid Array Jobs Minimum Cardinality For Chunks

When workflow is started and it contains a module with large number of instances, submitting array jobs small number of tasks at the beginning will be time efficient. Before submitting the job array, Pipeline has to create a special script and configure it for each instance. This procedure takes time and during the preparation grid engine will be idle. For example: Let's say we have a module with 1000 instances. It will take shorter time to prepare 50 jobs for submission than 1000 jobs. So this preference tells Pipeline to split into chunks instances of those modules which has cardinality X and more. The X is the value of this ( GridArrayJobsMinCardinalityForChunks ) parameter. It is a positive integer number and indicates the minimum cardinality value the module is required to have in order to split into chunks. If we set this parameter value to 200 then Pipeline first will submit a smaller chunk of job array ( the size of the chunk is configurable, see GridArrayJobsChunkSize ) and continue to submit others. But this time during the preparation of next instances, the grid engine will not be idle, it will already have first chunk to process.

3.14.8.3 Grid Array Jobs Chunk Size

This preference sets the size of the array jobs chunks ( see GridArrayJobsMinCardinalityForChunks for more info ).

3.14.8.4 Grid Array Jobs Dynamic Increase

When GridArrayJobsMinCardinalityForChunks is set and the module has to be divided into chunks. Sometimes after the first chunk it is better to have dynamic increase of chunk size, because after the first chunk submission the grid engine will not be idle and we can afford submitting larger arrays. This preference expects a boolean value ( true or false ) which will tell Pipeline whether the chunk size needs to be dynamically increased or not. If it is set to true then after the first submission Pipeline will multiply the initial GridArrayJobsChunkSize value by 2 upon each iteration. For example: If we have a module with 1000 instances and chunk size is 50, min cardinality for chunks is 200, then it will submit array jobs with following sizes

Array Job #	Number of instances	Total submitted so far
1	50	50
2	100	150
3	200	350
4	400	750
5	250	1000

When this preference is set to true, it is recommended to use GridArrayJobsMaxArraySize preference.

3.14.8.5 Grid Array Jobs Maximum Array Size

This preference expects a positive integer value which indicates the maximum array job size when the job size dynamically is increased. If not set, the default of this preference is 400. If set, Pipeline will not submit array jobs with bigger size than specified except one case. During the submission Pipeline checks how many instances remain to be submitted and when the remaining is more than the limit but there is only less than 10% left, then remaining instances will be carried with last job array. Here's is an example with total of 768 instances.

Array Job #	Number of instances	Total submitted so far/ Remaining
1	50	50 / 718
2	100	150 / 618
3	200	350 / 418
4	418	768 / 0

The last 418 is more than the specified 400. But 418-400=18 which is less than 76 ( 10% of total 768 instances).

3.15 User Management

Pipeline version 5.1 introduces new user management feature. It has a special algorithm which tries to fairly share available resources to all users. To enable user management EnableUserManagement and UserManagementLimitPercent preferences have to be set. EnableUserManagement expects a boolean value and turns on/off the user management. UserManagementLimitPercent is an integer within 0-100 range. How it works When enabled, each user will be able to submit only X percent of free slots at each moment. X is the number specified by UserManagementLimitPercent preference. For example: Let's say Pipeline has total 150 slots and the UserManagementLimitPercent is set to 50. The first user will be able to use 50% of 150, so 75 slots will be used by user A. Then User B will be able to submit 50% of free slots which is 50% of ( total 150 ) - ( user A 75 ) = 50% of ( free 75 ) = 37 and so on. Pipeline server constantly checks and monitors the user usage and adjusts each user's limit.

3.15.1 User Management Limit Percent

Please check User Management for more details.