AutoFlow
Description
Components
AutoFlow has three main modules: Stack, Queue Manager and Workflow Smith (Provider). The stack module takes a plain text file with a workflow description and identifies the Tasks to execute building Batchs (with one task or many, depending on the presence or not of iterative marks). Once the Stack builds the Batches and generates the atomic Tasks (with dependencies and hardware/software resources), they are transferred to the Queue Manager that builds a shell script to execute them or communicates with a Queue Manager software (in a supercomputer/HPC) to send the entire workflow and the Queue Manager software takes responsibility of wf execution. The wf_smith or wf_provider is in charge of supplying and managing the different virtualizations (venv, anaconda or containers) needed by each task.
How does AutoFlow work?
Autoflow takes a plain text file template as input (1), which describes each task to be executed and how these tasks are related to perform the execution in the correct order. In this way, AutoFlow executes the workflow template within the HPC/supercomputer (2). It identifies each task and creates the folder structure within the storage media. Then, it sends all the tasks to the Queue System (3). Finally, the queue system executes each task taking into account the dependencies to get the results succesfully.
First, AutoFlow identifies all the tasks and gets the dependencies between them (1). Then, Autoflow creates a execution folder with a dedicated subfolder for each task. These subfolders have a sh script that contains the code of the task to be executed (2). All the scripts are sent to the Queue System to be executed (3). The script execution obtains the results for each task (4). The script code assumes that temporal files or result files mut be placed in the subfolder. The input data is taken from another subfolder (task) or given by the user to an external data source.
Basic template sintaxis
List_dir){
#Initialize: definir entorno y ejecución de tareas pequeñas
? #Separador initialize-main command
ls > out
}
Show_list){
#Initialize
? #Separador initialize-main command
cat List_dir)/out
}
An AutoFlow template (orange box) describes each one of the tasks to be executed (yellow boxes). Each task must begin with its unique task name identifier (red text) followed by the ) paranthesis character. Then, the body of the task with the code to be executed must be declared between { } characters. This body is marked as grey and purple text. The former is comment text for user orientation and the later is the code to be executed. The task body is split in two sections with the ? character in one line. The initialize section, previous to the ? character, is designed to perform minor operations to prepare the execution of the main software of the task. The section after the ? character is the main command in which the code that executes the main software of the task must be written. This command separation has NO impact for the execution purposes but it's used to name task folder and show code parsed by Autoflow. The initialize section could be empty but the ? character line must ALWAYS be used to declare the main command section.
Finally, the dependencies between tasks are specified in an implicit way using the tasks names. As seen in the example template, the Show_list task has in its main command a cat isntruction that needs a input path. Instead of specifying an absolute or relative path, we use the task name of the List_dir task with the ) character (blue text) to build the input path. AutoFlow automatically replaces this with the absolute path of the List_dir task that contains the full execution of the task. In this way, the user specifies how to interconnect the workflow tasks and has no worries about where each task is executed.
Workflow Execution
Executing basic example
We will execute the previous example using -v flag to obtain a dry run that describes the full execution.
The dry run will parse the workflow template and generate all the folder tree with each task script sh but the scripts will not be executed.
Click to see results
List_dir >
ls > out
exec/ls_0000 False
Show_list >
cat exec/ls_0000/out
exec/cat_0000 False
List_dir
AutoFlow parses the worflow template, creates all the folders and task scripts, and shows the resulting task list as it should be executed (including the final absolute paths that are represented as the / character here). The red lines show the original task name the main command that represents each task (the task could have several operations but conceptually, only one generates the results of interest) and they highlight the identification of one task and its attributes (following indented lines). The yellow lines show the main command fully parsed with the paths to the other tasks when needed and other thigns such as resources, autoflow variables, etc). Note that the initialize command is NOT shown to get a summary view of the task. The blue lines indicate in which folder the task will be executed and has a boolean variable that indicates if the task is marked as commented (if it shows as True, the task will not be executed but its results will be taken into account for other tasks). The green lines (when listed) show the dependencies of the task. This means that the task will not be executed until the tasks it depends on are done. It should have one line with the task name for each dependency needed for the task.
Handling iterative tasks
When we work with workflows, we usually need to repeat one task several times using different parameters, samples, algorithms, etc. In this case, we would need a template like the following to repeat this task:
List_home){
#Initialize
?
ls /home > out
}
List_etc){
#Initialize
?
ls /etc > out
}
List_var){
#Initialize
?
ls /var > out
}
We would write one task for each one of the items that we are using, home, etc and var (the iterable items and as a set, we'll call them the iterator).
But AutoFlow has specific sintaxis to handle this situation and avoid the code redundance (and to adapt dinamically if the iterator set changes):
List_[home;etc;var]){
#Initialize
?
ls /(*) > out
}
The iterable items are enumerated in the list structure [home;etc;var] and AutoFlow iterates them generating one task per item. The task code (initialize or main command) must contain the (*) expression that will be replaced with each item. In this way we get a batch of tasks that only changes in one parameter but each one has its own folder and script. This can be observed when this workflow is executed:
Click to see results
List_home >
ls /home > out
exec/ls_0000 False
List_etc >
ls /etc > out
exec/ls_0001 False
List_var >
ls /var > out
exec/ls_0002 False
Working with iterative tasks and dependencies
As we have seen, we could use specific sintaxis to handle repetitive tasks and perform an iterative task. But, what about connecting other tasks with this iterative task?. Firstly, it is worth mentioning that there are two posibilities:
In the first case (left), we need to perform a new task onto each one of the generated iterative tasks. Basically, we need to connect a new iterative task with a previous iterative task. The second case (right), we need a single task that collects the results from all the tasks of the iterative task. Each case has a specific sintaxis. In the case of the iterative to iterative tasks:
Show_[home;etc;var]){
#Initialize
?
cat !List_*!/out
}
We have the classic iterative task but it uses !List_*! (the !ITERATIVE_TASK_NAME*! expression) to indicate we want to connect each previous task to each new one. In the case of connecting a set of iterative tasks to a single task, we have the following:
Show){
#Initialize
?
cat !List_!/out
}
We have a single task node that uses !List_! (the !ITERATIVE_TASK_NAME! expression) to collect the paths to the specified file in each previous task and insert them in the current task. The following presents the execution for the iterative dependency case:
Click to see results
List_home >
ls /home > out
exec/ls_0000 False
List_etc >
ls /etc > out
exec/ls_0001 False
List_var >
ls /var > out
exec/ls_0002 False
Show_home >
cat exec/ls_0000/out
exec/cat_0000 False
List_home
Show_etc >
cat exec/ls_0001/out
exec/cat_0001 False
List_etc
Show_var >
cat exec/ls_0002/out
exec/cat_0002 False
List_var
But now, we will apply this to theiterative to single dependency case:
Click to see results
List_home >
ls /home > out
exec/ls_0000 False
List_etc >
ls /etc > out
exec/ls_0001 False
List_var >
ls /var > out
exec/ls_0002 False
Show >
cat exec/ls_0000/out exec/ls_0001/out exec/ls_0002/out
exec/cat_0000 False
List_home
List_etc
List_var
Using static variables in workflow templates
The workflow sintaxis described previously only allows building workflow templates with hardcoded input. To allow dynamic asignation of input paths or parameters we can use AutoFlow static variables:
$folder=/var
List_dir){
#Initialize
?
ls $folder/out
}
We can define workflow static variables as $VARIABLE_NAME=value (in red in the example template) and use them anywhere in the template. They are string variables so they could contain simple parameters, paths, iterators, full task nodes... anything. But with the given example, we could think that the problem is the same due to the fact that the variable declaration is hardcoded in the template. This is true, but we can use the -V flag to override the template declaration (or simply make the variable declaration because the template lacks of these variable declarations). The sintaxis of the -V flag should be '$VAR1=value1,$VAR2=value2,..':
Click to see results
List_dir >
ls /home > out
exec/ls_0000 False
As shown, the command ls is made onto /home folder instead of the originally declared /var folder.
Special attributes to modify task execution behaviour
There is a set of special characters that, when used preceding the name of task when the task is defined, change its execution behaviour. These characters are % to make a task commented/not executable, ! to avoid creating subfolder for the task, and & to aggregate several tasks in a single one.
%Show_list){
?
ls /sys > out
}
!listing){
?
ls /etc > out
}
&stats){
?
wc -l listing)/out
}
char_stats){
?
wc listing)/out
}
Click to see results
show_list >
ls /sys > out
exec/ls_0000 True
listing >
ls /etc > out
exec False
stats >
wc -l exec/out
exec/wc_0000 False
listing
char_stats >
wc exec/out
exec/wc_0001 False
listing
In this way, the task show_list has its 'commented task/not execute' attribute as True whereas the rest have it as False. For the listing task, its path is exec instead exec/ls_0001 when ! is applied. The case of the stats task seems to do nothing but if we check the exec/wc_0000 folder it should be empty and if we read the exec/wc_0001/char_stats.sh file it will contain the commands of both, stats and char_stats tasks merged together.
Using nested tasks
In some cases, we need to repeat a set of tasks with diferent parameters and a first approach would be to convert each task of the set in an iterative task with the same iterator. To avoid this redundance we can use the nested tasks as can be seen in the following template:
ls_[temp;sys]){
?
scan){
?
pwd > file
show_[file;folder]){
?
echo `cat scan)/file` ls_(+) (*)
}
}
In red we find a clasical iterative task (ls_) but the body doesn't have typical commands. There are two nested nodes: scan and show_ (this last one an iterative task). In this case, AutoFlow will create one copy of these tasks for the temp item and another copy for the item sys, both in the ls_ iterative task. To call this iterator in the desired node we use the task name plus the (+) expression as in ls_(+). This behaviour is exactly the same as with the (*) expression but the difference is that this form allows us several levels of nesting and call in each location the desired iterator. We can see the template interpretation as following:
Click to see results
scan_temp >
pwd > file
exec/pwd_0000 False
scan_sys >
pwd > file
exec/pwd_0001 False
show_file_temp >
echo `cat exec/pwd_0000/file` temp file
exec/echo_0000 False
scan_temp
show_folder_temp >
echo `cat exec/pwd_0000/file` temp folder
exec/echo_0001 False
scan_temp
show_file_sys >
echo `cat exec/pwd_0001/file` sys file
exec/echo_0002 False
scan_sys
show_folder_sys >
echo `cat exec/pwd_0001/file` sys folder
exec/echo_0003 False
scan_sys
Regular expressions applied to task dependencies
When we work with nested tasks or in a complex workflow (in this case, you will use nested tasks) a problem arises. The iterations and the applied permutations will generate a bunch of tasks and we need to select a subset of tasks to follow the steps of our workflow. In this case, we need to capture specific tasks using their name to be able to apply the desired operation. For this purpose, regular expressions are very powerful and useful giving a great versatility to our workflow. We have to remember that with a set of tasks there are two cases: 1) we need to create a new task for each one or 2) we need to create a single task that takes data from the whole task set. The first case should be solved with the following:
Show_list){
?
ls /sys > out
}
listing){
?
ls -lsa /etc > out
}
get_content_[JobRegExp:list:-]){
?
wc -l (*)/out
}
Click to see results
Show_list >
ls /sys > out
exec/ls_0000 False
listing >
ls -lsa /etc > out
exec/ls_0001 False
get_content_Show_list >
wc -l exec/ls_0000/out
exec/wc_0000 False
Show_list
get_content_listing >
wc -l exec/ls_0001/out
exec/wc_0001 False
Show_list
listing
The second case (set of tasks to one single task) should be solved with the following:
Show_list){
?
ls /sys > out
}
listing){
?
ls -lsa /etc > out
}
get_content){
?
wc -l !JobRegExp:list:-!/out
}
Click to see results
Show_list >
ls /sys > out
exec/ls_0000 False
listing >
ls -lsa /etc > out
exec/ls_0001 False
get_content >
wc -l exec/ls_0000/out exec/ls_0001/out
exec/wc_0000 False
Show_list
listing
In both cases, the JobRegExp expression has defined two fields: one with the string 'list' to be searched in the tasks names and a second set to '-'. The second is a regexp to be applied to iterators. If the first has a match to an iterative task and the second field is set with a RegExp, this RegExp is applied to select only tasks with correct items in the parameter. The purpose of this is that the main RegExp could match with several iterative tasks but you are only interested in one iteration. Imagine that you execute 10 AI models with diferent values of one parameter (0, 1 and 2) and you need the executions of 0 to build ground truth. Then, you can use JobRegExp:launchAImodel:0 to capture the desired executions an only get the result for this case.
Advanced capabilities
Merging templates
Another powerful feature of AutoFlow is the possibility of merging several templates to reuse them or split a complex workflow in submodules. In this case we only have to give several paths to file templates to the -w flag separated by commas. AutoFlow will parse them in the specified order but the defined tasks will be put together in one workflow. Here, we show two templates that could be merged:
List template
List_dir){
# Initialize
?
ls > out
}
Show template
Show){
# Initialize
?
cat $file
}
Of course, we need to specify how to connect these templates which means setting the dependencies of tasks between different templates. To do so, we use the -V flag to define a variable that acts as input of the task in the 'Show template' and we can use it to specify a dependency to a node in the List template.
Click to see results
List_dir >
ls > out
exec/ls_0000 False
Show >
cat exec/ls_0000/out
exec/cat_0000 False
List_dir
Handling workflow resources
When we work with supercomputing resources, we need to ask for the specific resources for our work. In this case, we need to ask for a number of cpu, memory, time and specify which computing node type fits our task. To set workflow resources, AutoFlow has the -c, -m, -t and -n flags, respectively. Their definitions, as well as other complementary ones, are the following:
AutoFlow –w template_name #Mandatory arguments
#Optional arguments
-c: Number of cpus needed for each task
-t: Time needed for each task. Format: days-hours:minutes:seconds
-m: RAM memory needed for the task. Format: required number plus standard memory units: 5GB, 4000MB, etc
-n: Name of a specific system queue (often, computing nodes with specific hardware)
-s: If set, the requiered cpus could be allocated from several computing nodes
-u: Maximum number of computer nodes to allocate the requested cpus (per task)
This way of defining the resources has a limitation: all tasks are set with the same resources. To overcome this, we have a sintaxis to specify resources per task:
List_dir_[sys;etc]){
resources: -n bigmem -c 1 –t 7-00:00:00 –m 100gb
?
ls /(*) > out
}
This way, the List_dir_ tasks will have the specified resources overriding the general resources configured for the workflow. To observe the resource changes, we have to see the generated scripts in each task folder and read the commented section at the sh header.
Task execution control
In some cases, we need to execute only a subset of tasks in the workflow because we need to update results or due to minor errors that must be fixed. To do so, in addition to using the character % in the task name within the template, we can use the --white_list and --black_list flags. These flags are used as input string patterns separated by commas to match the task names. When a white flag is used the tasks that do NOT match with the patterns are marked NOT to be executed. If the black flag is used, the tasks that match the patterns are marked NOT to be executed.
algo){
?
echo 'OK'
}
result){
?
echo algo)/file
}
In this example, the task 'result' is marked to NOT be executed because the white_list flag does NOT match 'result'.
Click to see results
algo >
echo 'OK'
exec/echo_0000 False
result >
echo exec/echo_0000/file
exec/echo_0001 True
algo
In this example, the task result is marked to be executed because the black_list flag does match the 'algo' task and therefore marks it to NOT be executed. We have the opposite behaviour than the previous case.
Click to see results
algo >
echo 'OK'
exec/echo_0000 True
result >
echo exec/echo_0000/file
exec/echo_0001 False
algo
Advanced AutoFlow variable configuration
In complex workflows we have to deal with a large amount of variables that are hard to set in the command line or we need to change to different values according to the analyses. For this reason, the -V flag to set AutoFlow variables can use path to variable text files. In the following example, there is a template that uses the file_name, attribute1 and attribute2 variables with the last two defined in a file:
result){
?
echo -e "$attribute1\t$attribute2" > $file_name
}
Text file with variable definitions, basic_with_vars.var:
attribute1=exec
attribute2=login
Using var file with a workflow template:
Click to see results
result >
echo -e "exec\tlogin" > test
exec/echo_0000 False
Handling resource configuration files
When we have multiple tasks that share computational resources or we need to execute them using different resources in the same workflow, inline or cmd management could be difficult. To deal with this situation, AutoFlow has the option of using resource files in json (wf_with_res_prof.json) format in which we can define task resources profiles with a specific name:
{
"resources": {
"test": {
"cpu" : 2,
"mem" : "300GB",
"time" : "7-00:00:00",
"node" : "bigmem"
}
}
}
Then, the workflow template changes the resource line file to -r flag and the resource profile name that is needed for this task:
algo){
resources: -r test
?
echo -e "OK\t[cpu]" > log
}
And finally, when we execute the template with the described resource file, we can observe how the cpus are replaced with the number specified in the test profile.
Click to see results
algo >
echo "OK\t2" > log
exec/echo_0000 False
{'cpu': 2, 'mem': '300GB', 'time': '7-00:00:00', 'node': 'bigmem', 'multinode': 0, 'ntask': False, 'additional_job_options': None, 'done': False, 'folder': True, 'buffer': False, 'exec_folder': '/mnt/home/users/pab_001_uma/pedro/dev_py/py_autoflow/tests/cli_examples/exec/echo_0000', 'cpu_asign': 'number', 'virt': 'test_virt', 'virt_type': 'env'}
Handling external workflow dependencies
A workflow executes different software and libraries from programming languages getting them installed in the Operating System is required. If we execute our worflow in another computer, this software could be not available. For this reason, the resources file could include a virt section that describes the desired method to install the dependencies (mostly through virtualization strategies):
{
"resources": {
"test": {
"cpu" : 2,
"mem" : "300GB",
"time" : "7-00:00:00",
"node" : "bigmem"
"virt" : "test_virt"
"virt_type" : "env"
}
}
"virt": {
"test_virt" : {
"virt_type" : "env",
"venv_opts" : ["--system-site-packages"],
"requirements": ["cowsay"],
"pip_opts": []
}
}
}
Using the previous workflow template, we get the following:
Click to see results
algo >
echo "OK\t2" > log
exec/echo_0000 False
The dry execution doesn't show changes but if the generated sh is inspected, we could see how to include the loading of python virtual environment. This environment will include all the specified python libraries in the test_virt profile. This virtualization profile is called by the test resource profile with the key virt and the key virt_type specifies if is a venv, a anaconda environment or singularity image. The test_virt virtualization profile needs the key requirements as a string vector/list of library names (as pip specifications). We can also pass options for the creation of the venv using the venv_opts key with a vector/list. We use --system-site-packages to install only the libraries that are not in the system to avoid redundancy. Finally when the libraries are installed in the venv with pip, we can use additional options using the pip_opts key.
flow_logger
Description
Main purpose
The flow_logger function has three main purposes: 1) add tracking system to the task execution, 2) show workflow status and 3) manage the execution of failed tasks. To do this, flow_logger works as a loggin system and each executed task invokes the program at the start and at the stop of the execution (the user doesn't have to care about this because AutoFlow adds this commands to the sh script). A log file is saved by each task in which a status signal with its time record is written. The loggin system has three different signals: 1) set: It means that the task has been selected for execution and it has been sent to the execution engine, 2) start: The sh script started to execute as the flow_logger is the first command in the script and 3) stop: The sh script finished the execution and flow_logger was invoked at the last line of the script. In this way, a sucessful execution of the task must have the three signals. If there are several executions of the task, for each signal, the last record is selected.
We have to note that AutoFlow interprets the workflow and launches all the tasks at once to either the shell or the queue system. In this way, it has no way of knowing the workflow status and the worflow is managed by the operative system or the queue system. The user must check if the tasks are executing or not (a top command in shell or querying the queue system). If at least one task remains in execution or to be executed, we consider the workflow in execution. If there are not tasks in execution or waiting for execution, we consider the workflow finished. We can find that the lack of certain signals marks the status of the task as following: 1) SUCCESSFUL (SUCC):All task signals are detected. This status is not related with the workflow status, 2) RUNNING/ABORTED (RUN/ABORT): The stop signal is not detected. If the workflow is in execution, this means that the task is RUNNING but if the workflow is finished, the task had some kind of error and aborted and 3) PENDING/NOT EXECUTED (PEND/NOT): The start and stop signal are not detected, the task is only marked to be executed. If the workflow is en execution, this means that the task is not executed yet but if the workflow is finished, the task was never executed. This could be due to the failing of a task needed as dependency (very likely) or a hardware problem that made the computer fail the execution.
Execution modes
Workflow logging
First, we will execute a previous template (using -v to generate workflow structure only).
Click to see results
algo >
echo 'OK'
exec/echo_0000 False
result >
echo exec/echo_0000/file
exec/echo_0001 False
algo
It is not shown, but we have executed flow_logger to set the signals corresponding to the sucess of both tasks in the workflow. Then, we execute the flow_logger command to report the workflow task status. The flag -e is the path to the AutoFlow execution and -w tells flow_logger that the workflow is finished. The flag -r with the argument ALL makes the report for all the workflow tasks. If instead of ALL we use some of the task status listed previously, only tasks with this status are shown. The flag --raw is for debugging and allows this guide to capture the flow_logger. Ignore it for normal flow_logger use.
Click to see results
| Status | Folder | Time | Size | Job Name |
|---|---|---|---|---|
| SUCC | echo_0000 | 1 s | 1,5K | algo |
| SUCC | echo_0001 | 1 s | 1,5K | result |
The workflow report is a table with the following columns: 1) Status that corresponds with the task status described in the first section of flow_logger, 2) Folder indicates the workflow subfolder of the task, 3) Time is the elapsed time for the task execution, 4) Size indicates how much space storage is allocated by the task execution and 5) Job Name task name defined in the workflow template. Now, we will execute and configure an execution that has not finished:
Click to see results
algo >
echo 'OK'
exec/echo_0000 False
result >
echo exec/echo_0000/file
exec/echo_0001 False
algo
Now, we will execute flow_logger without -w flag, and we see a running task and a pending task:
Click to see results
| Status | Folder | Time | Size | Job Name |
|---|---|---|---|---|
| RUN | echo_0000 | - | 1,5K | algo |
| PEND | echo_0001 | - | 1,5K | result |
But if we add the -w flag that says to flow_logger that the workflow is finished, the task status changes to aborted and not (not executed).
Click to see results
| Status | Folder | Time | Size | Job Name |
|---|---|---|---|---|
| ABORT | echo_0000 | - | 1,5K | algo |
| NOT | echo_0001 | - | 1,5K | result |
Executing failed tasks
When we execute a workflow, we can find that some tasks have failed or are not launched due to hardware problems. If we change the -r flag for the boolean -l flag, flow_logger will analyse the workflow execution. All aborted tasks will be executed and all tasks that depend on the aborted tasks too. If there are not executed tasks (NOT) that do not depend on failed tasks due to a system failure we need to add the -p flag to execute them:
This command will execute the aborted or not executed tasks and if we execute a flow_logger report command (with -r and without -l nor -p flags), we will obtain the status table shown at the beginning of the previous section.
How to make controlled task errors
Use bash to execute an exit command if a user condition is not meet TODO.