May 11, 2009
The SCHEDULER tool can perform the partitioning/mapping/scheduling operation automatically, or you can assist by specifying (all or some of) the mappings and the tool will do the tedious work of generating the individual job-sequences (or PE-programs) and calculating the indices to move job-tokens (or for sending+receiving data items), etc..
As initially configured, the SCHEDULER utility produces time-line plots and pseudo-code that is directly interpretable by CSIM PE models. Once a satisfactory mapping has been determined, the pseudo-code can be conveniently expanded to compilable C-source code for specific processor families.
The input to the SCHEDULER is an XML DFG text file, which is often drawn with CSIM's GUI. The format of the DFG input file is published in Appendix-A that is generated by CSIM's GUI, or other tools.
This graph syntax, and this over-all method for developing parallel algorithm software and systems, is based on well developed and documented principles formally called computational flow graphs that are closely related to Petri net theory. See  for the formal background.
DFGs can be hierarchical. A hierarchical DFG contains nodes that are subgraphs. The subgraphs are drawn on separate diagrams. The top-level graph must have one START node, and one EXIT node.
Nodes with instance names beginning as GEN are special in that they produce periodic data at a specific interval; also called a monotonic interval. You may have multiple GEN nodes in a DFG, but each name must be unique. Any node in which the first three letters are GEN will be considered a GEN node. For instance, GEN1, GENa, GEN2, etc. are all GEN nodes.
Because the START, GEN, and EXIT nodes are often considered special control nodes, they are often not mapped to any physical processor element. You can do this by specifying their type as NO_PE. Actually, any node can be set as a NO_PE type node. Such nodes do not have to wait for any PEs to become available, and their input and output arcs do not imply nor generate data-transfers, since their existence is only logical (conceptual) there is no place to transfer data to-or-from. However, their effect relative to the DFG is taken into account as far as triggering their dependent nodes accordingly.
Three special type options provide additional control as to how a task is mapped to processor elements. These options are:
By convention, the Compute Time is often specified in micro-seconds (uSec). It indicates how long a processor is busy computing the task. It also determines the time between when the output data is produced onto the node's output arcs from when the task began to execute. Time is a floating- point quantity.
Some tasks do not produce their output data in one batch at the completion of a firing. Rather, they produce data in several batches during their execution. To describe this behavior, and many others as well, nodes have an iteration parameter. The Iterations parameter determines how many times a node will execute and produce data for one firing. Here a firing means the satisfaction of the node's input thresholds and consumption of one batch of input data. Normally, the iteration parameter is one (1). The iteration value is an integer. The time between successive output iterations is the compute time of the node.
For example, if a task executes for 100-mS and produces 10 batches of data in this time at equally spaced intervals, then the task would be described as a DFG node having 10-iterations, and a compute-time of 10-mS. Once a node fires, it is assigned to a specific processor element and all iterations of the task occur on that processor.
The final optional parameter of a DFG node is the Mapping Assignment. This is an optional parameter. The mapping parameter designates the specific processor element a task will execute on. A list of processor assignments can be specified. A list is equivalent to specifying that each instance of the task is to execute on the respective sequence of processor elements. For example, the first instance of the task is to execute on the first listed processor, the second on the second, etc.. If the list is exhausted, the next instance is assigned to the first processor on the list in a round-robin fashion.
An arc implies a potential buffer, queue, or temporary storage of data between task-operations. We speak of the amount of data held by an arc as the data in the arc's associated buffer. An arc is characterized by its produce, threshold, and consume amounts. These are parameters that influence how a graph executes. The amount of data held by the arc's buffer varies dynamically as the graph executes.
Arcs are characterized by 5 attributes.
An arc's initial-amount is normally zero. This suffices for feed-through DFG graphs, but causes graphs containing feed-back to dead-lock because the node receiving a feed-back-arc can never fire the first time. The initial-amount can be set to a non-zero integer quantity. This is sometimes called, priming-the-pump. Non-zero initial-amounts are rarely be used. Most users can ignore it because it automatically defaults to zero. However, it is available when needed.
There can be several arcs connected to a given node. To distinguish the data carried by each arc coming into -and going out of- a node, a name can be given to the destination and source of each arc, respectively. The source and destination names are not relevant on a flat graph for performance modeling because no data actually flows. However they become relevant for functional flow-graph modeling, target code-generation, and for connection of internal-to-external arcs on any kind of hierarchical graphs. For target code-generation, the source and destination names become the subroutine parameter names used in the generated code.
This setting causes the Scheduler to generate postmessg / readmessg command-pairs in the .prog files instead of the normal sendmessg/recvmessg pairs, respectively. The readmessg will be sequenced after any recvmessg(s) prior to a given compute task.
mSec = * 1.0e3 Sec = * 1.0e6 Minutes = * 60.0 Sec FFT_Delay = 0.68 mSec FFT_Size = 4096 FIR_Node = FFT_Delay MATMUL = FFT_DELAY * 40.7 + 2.3 PE2 = Board1Mcm0Pe0These macros can then be used anywhere in the DFG. Their values will be replaced prior to evaluating the DFG. Arbitrary arithmetic expressions will be evaluated for macros.
Several basic functions are available for use within expressions.
Note that macros are not data-variables in the normal sense of variables in program code. Although in many respects they are similar, except that the macros are expanded and evaluated in each instance. (Deferred evaluation.)
macro a = 5 macro b = 2 * a print a, b a = 5, b = 2*a = 10 macro a = 7 print a, b a = 7, b = 2*a = 14This could result in infinite recursion for self-inclusive macros.
A related class, called variable, assigns the current evaluated value to the object name. (Immediate evaluation.)
macro a = 5 variable b = 2 * a print a, b a = 5, b = 10 macro a = 7 print a, b a = 7, b = 10The variable type is especially valuable when using random numbers. For example, often several values depend on one random quantity. This would not be possible to express with macros, because each reference would be a new call to the random generator. Instead, one random number can be generated and saved as a variable and the same value can be referenced multiple times.
Using a variable to store the RANDOM value: variable message_size = ROUND( 1000 * RANDOM ) macro Bandwidth = 100.0 macro xfer_delay = message_size / Bandwidth print message_size, xfer_delay message_size = 375 xfer_delay = 3.75 print message_size, xfer_delay message_size = 375 xfer_delay = 3.75 But with only macros: macro message_size = ROUND( 1000 * RANDOM ) macro Bandwidth = 100.0 macro xfer_delay = message_size / Bandwidth print message_size, xfer_delay message_size = 375 xfer_delay = 5.21 print message_size, xfer_delay message_size = 863 xfer_delay = 1.09 (RANDOM was re-evaluated on each reference of the macro.)
The SCHEDULER accepts, as the optional second command-line argument, the netinfo file produced by the net_extract command of the CSIM simulator. This ensures consistency with the name- to -logical-ID numbers used by the ROUTER and the simulation models themselves.
Example: scheduler radar.dfg netinfoIf no netinfo file is on the command-line, then the SCHEDULER knows nothing about the architecture it is mapping to, nor what processors are available. In this situation, the SCHEDULER will ask for the number of processors the DFG should be mapped to.
In either case, the SCHEDULER will then perform the mapping by either using the user-supplied assignments or a set of heuristics that attempt to balance the processing load equally among processors, to minimize the processing latency, and to minimize inter-PE communications traffic by making maximum use of high data locality (I.E. automatic mapping).
In general, the automatic mode is less optimal than manually optimized mappings. However, the automatic mode is useful for quick first-cut attempts and it will be improved over time.
During scheduling, you will see the list of assignments being made as the DFG(s) executes.
To use Rate Monotonic Scheduling (RMS) with the CSIM Scheduler, the -stim scheduler option is to be used with properly prepared DFG files. The files may be generated automatically with the dfg_gen utility from *.txt files generated by spread-sheets. Or, it may be generated manually with the CSIM DFG GUI. The scheduler generates *.prog files that are intended to be used with the multi_priority_pe.
Each task block that runs at an independent monotonic rate is to be graphed as a separate data flow graph. A 'stim_file' lists all the graphs to be used.
Each graph that represents a separate task block must have START and EXIT
nodes specified. If the graph is to be scheduled at a monotonic rate, a graph
attribute or macro is to define it as RMS = 1 or true. The rate at which
the graph is to be repeated is defined by a GEN node which follows the START
node. The loop may be terminated by either a time limit or a loop count.
The graph attributes:
looptime = "time limit"
loopcount = "number"
may be used.
The scheduler generates the *.prog files. If a looping attribute is used, the task blocks are unwrapped in the *.prog files until the looptime or loopcount is reached. Unique message ID's are used for each iteration.
If no loop attribute is used, the graph is executed only once.
To schedule the files for RMS scheduling, invoke:
sched -stim stim_file netinfo
Note the format of the *.prog files generated. Each task begins with TASK and ends with END_OF_TASK. The first statement after TASK is a monotonic statement which specifies the loop time. The last statement in each *.prog file is a 'progmdone' line.
Global Attributes for RMS:
The time-line plots are considered ideal because the static SCHEDULER bases its estimates of task completion and communication times on a simplistic ideal model. The ideal time-lines are useful to compare with the actual time-lines produced by detailed architecture simulations. The comparison highlights the effects of architectural issues on the given mapping.
The ideal time-line can be viewed with XGRAPH
by selecting Tools / Plot Ideal Timeline from the GUI menu.
Or by typing:
xgraph IdealTline.datThe resulting graph shows the projected activities of the processors versus time.
It is usually helpful to overlay the inter-processor communication dependencies underneath the activity time-line. This produces a very complete picture of the data-flow-graph's execution history. You can do this by placing both plot-files on the XGRAPH command-line, as in:
xgraph IdealSpiderweb.dat IdealTline.dat(Placing the activity time-line second, ensures that the horizontal activity-bars are drawn over-top the diagonal communication traces. This gives the activity-bars priority which prevents their obscuration.)
Each horizontal processor activity bar-segment on the time-line plot corresponds to the execution of a DFG-node-task on the given processor. The SCHEDULER uses a colorization function to assist in distinguishing the task-segments.
Tasks are color-coded by the last digit of the function-names: (or, last letter) 0 = red (b,l,v,D,N,X) 1 = blue, (c,m,w,E,O,Y) 2 = green, (d,n,x,F,P,Z) 3 = violet, (e,o,y,G,Q) 4 = orange, (f,p,z,H,R) 5 = yellow, (g,q,I,S) 6 = pink, (h,r,J,T) 7 = cyan, (i,s,A,K,U) 8 = light-gray, (j,t,B,L,V) 9 = dark-gray. (a,k,u,C,M,W)Being somewhat arbitrary, the color-table usually produces a rather random colorization which helps distinguish tasks. Knowing this color-map also helps to select or modify your task-names to control your desired color visualizations.
Basing it on the last character of task-names is often helpful, since many mappings place multiple iterations of a similar task on a given node. This enables distinguishing otherwise similarly-named tasks, such as with: xyz_1, xyz_2, xyz_3, etc..
DEFINE_DEVICE_TYPE: Sharc DEVICE_CLASS=(programmable); PORT_LIST( p0, p1, p2, p3, p4, p5, p6 ); /* Local Variables */ int my_id, ii, pc, done; struct header_struct *message; ....
The -n option is useful for controlling the number of processors to map tasks to when:
Alternatively, you can also direct the .prog files to another subdirectory by setting the environment variable PMOD_PROG to the directory where you want the program files to be stored. The PE models also access the PMOD_PROG environment variable, so they automatically know where to find them. (Note that setting environment variables affects only the current session, so placing the setting in a script file is convenient for future sessions.)
Verbosity levels > 0 - show basic compute and produce events.
Also shows the individual STIM commands, if processing
in STIM-driven mode. Also shows each major processing step
and each time-step during scheduling.
Verbosity levels > 1 - show macro and variable definitions as they are registered.
Verbosity levels > 2 - show macro substitutions as they occur.
Verbosity levels > 5 - show the defined nodes and their parameters as read from the DFG file.
Verbosity levels > 6 - show the defined arcs and their parameters read from the DFG file.
Verbosity levels > 20 - shows a listing of the final resultant defined nodes, arcs, their parameters in the flattened DFG structure.
Verbosity levels > 50 - show the intermediate steps in expression evaluations.
Verbosity levels > 100 - show each line as it is read for parsing.
Verbosity levels > 500 - show geometry info during file-read.
Verbosity levels > 1000 - show individual status changes to all nodes and arcs during DFG execution/scheduling.
Higher verbosity settings produce more output to the screen. It is helpful during debugging to capture the output to a file for convenient viewing. Use the Unix redirect operator ">".
Instead of a DFG file on the command-line, the STIM-Scheduler mode requires a STIM file on the command-line. The STIM file is a list of DFG files, together with a DFG invocation sequence. See STIM-Scheduler.
Appendix B - STIM_SCHEDULER
Appendix C - Dynamic Scheduler
Appendix D - Special Attribute Options - Reset_After_Firing, Multicast Node, Multiple Timers, Hard EXIT, etc..
Appendix E - Mapping Hints - Useful things to know about mapping tasks to processors.
Appendix F - DFG Animation - How to view DFG execution.
 Siework, Bell, Newel, Computer Structures Principles and Examples, McGraw Hill Computer Science Series, New York, 1982.
(Questions, Comments, & Suggestions: firstname.lastname@example.org)