Home > Datastage-Best Practices > Sorting In Datastage

Sorting In Datastage

In datastage we have two types of sorting methods available in parallel jobs.

1)Link Sort

2)Sort Stage

By default both metods use the same sort package (datastage tsort operator)

The Link sort offers fewer options but is easy to maintain in a datastage job as there are only few stages in the datastage job canvas.

The stand alone sort offers more options.

Use the Link sort unless a specific option is needed on the stand-alone Stage. Most often, the standalone Sort stage is used to specify the Sort Key mode for partial sorts.

1)Link Sort

Sorting on a link option is specified on the input/partitioning stage options,when specifying a keyed partition method and this option is not available if you use auto partition method:

Additional options can be specified by right clicking on the key column as shown in below figure.

Key Column Usage:

1)Sorting only

2)Partitioning only

3)Sorting,Partitioning

If you want to remove duplicates you can check Unique option instead of using Remove duplicates stage.

2)Sort Stage:

Standalone sort stage offers more options which are not available in Link Sort.

the following properties are not available when sorting on a link:

Sort Key Mode:Which is very important performance improvement option

Create Cluster Key Change Column:True creates the column clusterKeyChange. This column is set to 1 for the first row in each group where groups are defined by using a Sort Key Mode of Don’t Sort. Subsequent records in the group have the column set to 0.

Create Key Change Column:

Output Statistics:

Sort Utility:(Never change this)

Restrict Memory Usage:Causes the stage to restrict itself to the specified number of megabytes of virtual memory on a processing node.

We use Sort Key Mode option most frequently.

Stable Sort
Stable sorts preserve the order of non-key columns within each sort group. This requires some
additional overhead in the sort algorithm, and thus a stable sort is generally slower than a non-stable
sort for the same input Data Set and sort keys. For this reason, disable Stable sort unless needed.

It is important to note that by default the Stable sort option is disabled for sorts on a link and Enabled
with the standalone Sort stage.

Sub-Sorts
Within the standalone Sort stage, the key column property “Sort Key Mode” is a particularly powerful
feature and a significant performance optimization. It is used when resorting a sub-grouping of a
previously sorted input Data Set, instead of performing a complete Sort. This “subsort” uses
significantly less disk space and CPU resource, and can often be performed in memory (depending on
the size of the new subsort groups).
To resort based on a sub-grouping, all key columns must still be defined in the Sort stage. Re-used sort
keys are specified with the “Don’t Sort (Previously Sorted)” property, while new sort keys are
specified with the “Sort” key mode property, as shown in the following example:

Automatically inserted sorts:

By default data datastage parallel jobs inserts sort operators as necessary to ensure correct results.We can identify this in the datastage director by adding APT_DUMP_SCORE env variable in the job.

op2[4p] {(parallel inserted tsort operator {key={value=CUSTOMER_ID, subArgs={asc}}})
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}

Typically datastage inserts sort operator before any stage that requires sorted input if we don’t sort the data before passing to the stage.

examples:

Join, Merge, Remove Duplicates, Sort Aggregator

There are two ways to prevent Enterprise Edition from inserting an un-necessary sort:
1) Insert an upstream Sort stage on each link, define all sort key columns with the Sort Mode
key property “Don’t Sort (Previously Sorted)”

2) Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. This will verify sort
order but not actually perform a sort, aborting the job if data is not in the required sort order

Tuning Sort:

Sort is very expensive task in datastage as it requires Cpu,Memory and disk resources. Datastage uses memory to sort input data if all fits into memory otherwise it uses scratch space defined the configuration file.

By default each sort uses 20MB memory per partition and this value can be changed in each stand alone sort stage by adding “Restrict Memory Usage” option.

We can use APT_TSORT_STRESS_BLOCKSIZE  to specify the size of the memory buffer, in MB, for all sort operators (link and standalone), overriding any per-sort specifications.

The file system configuration and number of scratch disks defined in parallel configuration file can
greatly impact the I/O performance of a parallel sort. Having a greater number of scratch disks for each
node allows the sort to spread I/O across multiple file systems

  1. August 16, 2011 at 8:24 pm

    Hi,

    Thank You Uma maheshwar garu,
    This blog is very useful for me.
    If i have any doubts pls give me reply………..

    Thank you.
    Ramesh

  1. No trackbacks yet.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: