Archive

Archive for April, 2011

Pivot Enterprise stage

April 29, 2011 Leave a comment

Pivot Enterprise stage:

The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.

Pivot Types:

1)Horizontal
2)Vertical

Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows.
The output data of the horizontal pivot action typically has fewer columns, but more rows than the input
data.

Input Data:

CUSTOMER_ID CUSTOMER_NAME JAN_EXPENSE FEB_EXPENSE MAR_EXPENSE
1 UMA 100 200 300
2 POOJITHA 200 300 400

Output Data:

CUSTOMER_ID CUSTOMER_NAME Q1EXPENSE
1 UMA 100
1 UMA 200
1 UMA 300
2 POOJITHA 200
2 POOJITHA 300
2 POOJITHA 400

Specify a horizontal operation in Pivot enterprise stage.

Create basic datastage job using sequential file as input stage to read input data,Pivot Enterprise stage and sequential file stage as output to create output data.

Basic Job Design:

Configure sequential file stage as shown below to read input data(comma delimited input file).

Step2)Configure pivot enterprise stage.Double click on pivot enterprise stage and configure shown like below.

*Go to the properties page and select Pivot Type = Horizontal

Click on Pivot Properties tab on stage properties page and follow following steps to configure for Horizontal pivot:

You can generate a pivot index that will assign an index number to each row within sets of horizontally
pivoted data:

Click on the above highlighted red circle.

Now Click on the Mapping tab of the output page:

Now configure Output sequential file stage to create a delimited output file as shown below.

Compile the job:

Run the job:

Output:

vertical pivot operation:

Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array
size determines the number of rows in the output data. The output data of the vertical pivot action
typically has more columns, but fewer rows than the input data.

Input Data:

CUSTOMER_ID CUSTOMER_NAME Q1EXPENSE
1 UMA 100
1 UMA 200
1 UMA 300
2 POOJITHA 200
2 POOJITHA 300
2 POOJITHA 400

Output Data:

CUSTOMER_ID CUSTOMER_NAME JAN_EXPENSE FEB_EXPENSE MAR_EXPENSE Q1AVERAGE
1 UMA 100 200 300 200
2 POOJITHA 200 300 400 300

Now open Pivot enterprise stage and go the properties tab on stage page:

Go to pivotproperties tab on the stage page and click load button to load metadata:

Now specify array size as 3 because we are have 3 months expenses in the quarter expense of in the input data and we need to convert into 1 row.

Configure Mapping tab in the output page of the stage.

Compile and run the job:

Categories: Datastage-Stages

Designing Datastage Jobs for Good Perfomance

April 27, 2011 Leave a comment

Performance Points:

1) Avoid unnecessary type conversions: set the OSH_PRINT_SCHEMAS environment variable to verify that run time schemas match the job design column definitions. If you are using stage variables on a Transformer stage, ensure that their data types match the expected result types.

2) Use Transformer stages sparingly and wisely. Transformer stages can slow down your job. Do not have multiple stages where the functionality could be incorporated into a single stage, and use other stage types to perform simple transformation operations

3) Increase Sort performance where possible. Careful job design can improve the performance of sort operations, both in standalone Sort stages and in on-link sorts specified in the Inputs page Partitioning tab of otherstage types.

4) Remove Unneeded Columns. Remove unneeded columns as early as possible within the job flow. Every additional unused column requires additional buffer memory, which can impact performance and make each row transfer from one stage to the next more expensive. If possible, when reading from databases, use a select list to read just the columns required, rather than the entire table.

5) Avoid reading from sequential files using the Same partitioning method. Unless you have specified more than one source file, this will result in the entire file being read into a single partition, making the entiredownstream flow run sequentially unless you explicitly repartition

The environment variable APT_STRING_PADCHAR can be used to change the default pad character from an ASCII NULL (0x0) to another character; for example, an ASCII space (Ox20) .

6)It is important to note that the individual SQL statements required by a Sparse Lookup are an expensive operation from a performance perspective. In most cases, it is faster to use a DataStage Join stage between the input and DB2 reference data than it is to perform a Sparse Lookup.

7) For scenarios where the number of input rows is significantly smaller (1:100 or more) than the number of reference rows in a DB2 or Oracle table, a Sparse Lookup may be appropriate. CPU-intensive applications, which typically perform multiple CPUdemanding operations on each record, benefit from the greatest possible parallelism, up to the capacity supported by your system.

8)Parallel jobs with large memory requirements can benefit from parallelism if they act on data that has been partitioned and if the required memory is also divided among partitions.

9)Applications that are disk- or I/O-intensive, such as those that extract data from and load data into RDBMSs, benefit from configurations in which the number of logical nodes equals the number of disk spindles being accessed. For example, if a table is fragmented 16 ways inside a database or if a data set is spread across 16 disk drives, set up a node pool consisting of 16 processing nodes.

10)For some jobs, especially those that are disk-intensive, you must sometimes configure your system to prevent the RDBMS from having either to redistribute load data or to re-partition the data from an extract operation.

11) The speed of communication among stages should be optimized by your configuration. For example, jobs whose stages exchange large amounts of data should be assigned to nodes where stages communicate by either shared memory (in an SMP environment) or a high-speed link (in an MPP environment). The relative placement of jobs whose stages share small amounts of data is less important.

12)Turn off Runtime Column propagation wherever it’s not required.

13) If you are using a copy or a filter stage either immediately after or immediately before a transformer stage, you are reducing the efficiency by using more stages because a transformer does the job of both copy stage as well as a filter stage

14)If you have a single input and single output when you are using a copy stage then set the force option to tru so that DataStage should not try to optimize the job by removing the Copy operation.


Script to check datastage job is multi instance or not

April 27, 2011 Leave a comment

The following script can be used to check if the given datastage job is multi instance or not.

Arguments to the Scripts:

Arg1:Datastage Project Name
Arg2:JobName

#!/bin/ksh
ScriptName=$(\basename $0)
if [[ $# != 2 ]];then
printf “Error: $ScriptName – Required parameter(s) was not supplied!\n\n”
printf “Usage: $ScriptName  <Project> <JobName>\n\n”
printf ”       <Project> : is the unique identifier for the datastage project, required\n”
printf ”       <JobName> : is the Datastage JobName, required\n”
exit 1
fi

Project=$1
JobName=$2
echo “step ………: checking job whether it is multiinstance or not”
DSHOME=`cat /.dshome`
tmpFile=/tmp/CheckJobType.tmp
cd $DSHOME
bin/uvsh <<! > $tmpFile
LOGTO $Project
SELECT  ‘JobType:’||EVAL “TRANS(‘DS_JOBOBJECTS’,’J\':@RECORD<5>:’\ROOT’,59,’X’)”  FROM DS_JOBS WHERE NAME = ‘$JobName';
!

if [[ $? != 0 ]]; then
echo “CheckJobType – unable to determine jobtype for job $JobNameh in $Project datastage project!”
exit 1
fi

JobType=$(\grep “^JobType:” $tmpFile |\cut -f 2 -d :)
if [[ -z $JobType ]]; then
echo “CheckJobType – Job $JobName not found in $Project datastage project. JobType cannot be determined!”
\rm -f $tmpFile
exit 1
fi

if [[ $JobType -eq 0 ]]; then
echo “*****$JobName is not a multi instance job*****”
else
echo “*****$JobName is  multi instance job*****”
fi
\rm -f $tmpFile
exit 0

The Aggregator Stage—Datastage

April 27, 2011 Leave a comment

The Aggregator Stage:

Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By Default Aggregator stage will execute in parallel mode in parallel jobs.

Note:In a Parallel environment ,the way that we partition data before grouping and summary will affect the results.If you parition data using round-robin method and then records with same key values will distruute across different partiions and that will give in correct results.

Aggregation Method:

Aggregator stage has two different aggregation Methods.

1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory.

2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires presorted data, but only maintains the calculations for the current group in memory.

Aggregation Data Type:

By default aggregator stage calculation output column is double data type and if you want decimal output then add following property as shown in below figure.

If you are using single key column for the  grouping keys then there is no need to sort or hash partition the incoming data.

Categories: Datastage-Stages

Datastage Common Errors-Warnings and resolution

April 27, 2011 3 comments

1)When we use same partitioning in datastage transformer stage we get the following warning in 7.5.2 version.

TFCP000043      2       3       input_tfm: Input dataset 0 has a partitioning method other than entire specified; disabling memory sharing.

This is known issue and you can safely demote that warning into informational by adding this warning to Project specific message handler.

2) Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0

Resolution: Clear the preserve partition flag before Sequential file stages.

3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable

On aix execute following command to check maxuproc setting and increase it if you plan to run multiple jobs at the same time.

lsattr -E -l sys0 | grep maxuproc
maxuproc        1024               Maximum number of PROCESSES allowed per user      True

4)TFIP000000              3       Agg_stg: When checking operator: When binding input interface field “CUST_ACT_NBR” to field “CUST_ACT_NBR”: Implicit conversion from source type “string[5]” to result type “dfloat”: Converting string to number.

Resolution: use the Modify stage explicitly convert the data type before sending to aggregator stage.

5)Warning: A user defined sort operator does not satisfy the requirements.

Resolution:check the order of sorting columns and make sure use the same order when use join stage after sort to joing two inputs.

6)TFTM000000      2       3      Stg_tfm_header,1: Conversion error calling conversion routine timestamp_from_string data may have been lost

TFTM000000              1       xfmJournals,1: Conversion error calling conversion routine decimal_from_string data may have been lost

Resolution:check for the correct date format or decimal format and also null values in the date or decimal fields before passing to datastage StringToDate, DateToString,DecimalToString or StringToDecimal functions.

7)TOSO000119      2       3      Join_sort: When checking operator: Data claims to already be sorted on the specified keys the ‘sorted’ option can be used to confirm this. Data will be resorted as necessary. Performance may improve if this sort is removed from the flow

Resolution: Sort the data before sending to join stage and check for the order of sorting keys and join keys and make sure both are in the same order.

8)TFOR000000      2       1       Join_Outer: When checking operator: Dropping component “CUST_NBR” because of a prior component with the same name.

Resolution:If you are using join,diff,merge or comp stages make sure both links have the differnt column names other than key columns

9)TFIP000022              1       oci_oracle_source: When checking operator: When binding output interface field “MEMBER_NAME” to field “MEMBER_NAME”: Converting a nullable source to a non-nullable result;

Resolution:If you are reading from oracle database or in any processing stage where incoming column is defined as nullable and if you define metadata in datastage as non-nullable then you will get above issue.if you want to convert a nullable field to non  nullable make sure you apply available null functions in datastage or in the extract query.

Null function in oracle:NVL,NVL2

Datastage:IsNull,NullToEmpty,NullToZero

Datastage Schema File usage

April 27, 2011 2 comments

In Datastage Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs.

Schema format:

The following schema format is used to read a fixed width file:

//Schema File  is used to read Input data with out specifyinh metadata in the Sequential File stage
//Creatd On : 11/17/2010
//Creatd By : Uma M Katru
record
{final_delim=end,delim=none}
(
CUSTOMER_SSN:STRING[11];
CUSTOMER_NAME:STRING[30];
CUSTOMER_CITY::STRING[40];
CUSTOMER_ZIPCODE:STRING[10];
)

Now we use sequential file stage to read data from the datafile using schemafile.

In order to use schema file concept we need to enable runtime propagation in the job properties.

In the sequential file stage add Schema File option and give schema file name including path as shown like below.

Sorting In Datastage

April 26, 2011 1 comment

In datastage we have two types of sorting methods available in parallel jobs.

1)Link Sort

2)Sort Stage

By default both metods use the same sort package (datastage tsort operator)

The Link sort offers fewer options but is easy to maintain in a datastage job as there are only few stages in the datastage job canvas.

The stand alone sort offers more options.

Use the Link sort unless a specific option is needed on the stand-alone Stage. Most often, the standalone Sort stage is used to specify the Sort Key mode for partial sorts.

1)Link Sort

Sorting on a link option is specified on the input/partitioning stage options,when specifying a keyed partition method and this option is not available if you use auto partition method:

Additional options can be specified by right clicking on the key column as shown in below figure.

Key Column Usage:

1)Sorting only

2)Partitioning only

3)Sorting,Partitioning

If you want to remove duplicates you can check Unique option instead of using Remove duplicates stage.

2)Sort Stage:

Standalone sort stage offers more options which are not available in Link Sort.

the following properties are not available when sorting on a link:

Sort Key Mode:Which is very important performance improvement option

Create Cluster Key Change Column:True creates the column clusterKeyChange. This column is set to 1 for the first row in each group where groups are defined by using a Sort Key Mode of Don’t Sort. Subsequent records in the group have the column set to 0.

Create Key Change Column:

Output Statistics:

Sort Utility:(Never change this)

Restrict Memory Usage:Causes the stage to restrict itself to the specified number of megabytes of virtual memory on a processing node.

We use Sort Key Mode option most frequently.

Stable Sort
Stable sorts preserve the order of non-key columns within each sort group. This requires some
additional overhead in the sort algorithm, and thus a stable sort is generally slower than a non-stable
sort for the same input Data Set and sort keys. For this reason, disable Stable sort unless needed.

It is important to note that by default the Stable sort option is disabled for sorts on a link and Enabled
with the standalone Sort stage.

Sub-Sorts
Within the standalone Sort stage, the key column property “Sort Key Mode” is a particularly powerful
feature and a significant performance optimization. It is used when resorting a sub-grouping of a
previously sorted input Data Set, instead of performing a complete Sort. This “subsort” uses
significantly less disk space and CPU resource, and can often be performed in memory (depending on
the size of the new subsort groups).
To resort based on a sub-grouping, all key columns must still be defined in the Sort stage. Re-used sort
keys are specified with the “Don’t Sort (Previously Sorted)” property, while new sort keys are
specified with the “Sort” key mode property, as shown in the following example:

Automatically inserted sorts:

By default data datastage parallel jobs inserts sort operators as necessary to ensure correct results.We can identify this in the datastage director by adding APT_DUMP_SCORE env variable in the job.

op2[4p] {(parallel inserted tsort operator {key={value=CUSTOMER_ID, subArgs={asc}}})
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}

Typically datastage inserts sort operator before any stage that requires sorted input if we don’t sort the data before passing to the stage.

examples:

Join, Merge, Remove Duplicates, Sort Aggregator

There are two ways to prevent Enterprise Edition from inserting an un-necessary sort:
1) Insert an upstream Sort stage on each link, define all sort key columns with the Sort Mode
key property “Don’t Sort (Previously Sorted)”

2) Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. This will verify sort
order but not actually perform a sort, aborting the job if data is not in the required sort order

Tuning Sort:

Sort is very expensive task in datastage as it requires Cpu,Memory and disk resources. Datastage uses memory to sort input data if all fits into memory otherwise it uses scratch space defined the configuration file.

By default each sort uses 20MB memory per partition and this value can be changed in each stand alone sort stage by adding “Restrict Memory Usage” option.

We can use APT_TSORT_STRESS_BLOCKSIZE  to specify the size of the memory buffer, in MB, for all sort operators (link and standalone), overriding any per-sort specifications.

The file system configuration and number of scratch disks defined in parallel configuration file can
greatly impact the I/O performance of a parallel sort. Having a greater number of scratch disks for each
node allows the sort to spread I/O across multiple file systems

Follow

Get every new post delivered to your Inbox.

Join 182 other followers