Archive

Archive for the ‘Datastage-Stages’ Category

Pivot Enterprise stage

April 29, 2011 Leave a comment

Pivot Enterprise stage:

The Pivot Enterprise stage is a processing stage that pivots data horizontally and vertically.

Pivot Types:

1)Horizontal
2)Vertical

Horizontal pivoting maps a set of columns in an input row to a single column in multiple output rows.
The output data of the horizontal pivot action typically has fewer columns, but more rows than the input
data.

Input Data:

CUSTOMER_ID CUSTOMER_NAME JAN_EXPENSE FEB_EXPENSE MAR_EXPENSE
1 UMA 100 200 300
2 POOJITHA 200 300 400

Output Data:

CUSTOMER_ID CUSTOMER_NAME Q1EXPENSE
1 UMA 100
1 UMA 200
1 UMA 300
2 POOJITHA 200
2 POOJITHA 300
2 POOJITHA 400

Specify a horizontal operation in Pivot enterprise stage.

Create basic datastage job using sequential file as input stage to read input data,Pivot Enterprise stage and sequential file stage as output to create output data.

Basic Job Design:

Configure sequential file stage as shown below to read input data(comma delimited input file).

Step2)Configure pivot enterprise stage.Double click on pivot enterprise stage and configure shown like below.

*Go to the properties page and select Pivot Type = Horizontal

Click on Pivot Properties tab on stage properties page and follow following steps to configure for Horizontal pivot:

You can generate a pivot index that will assign an index number to each row within sets of horizontally
pivoted data:

Click on the above highlighted red circle.

Now Click on the Mapping tab of the output page:

Now configure Output sequential file stage to create a delimited output file as shown below.

Compile the job:

Run the job:

Output:

vertical pivot operation:

Vertical pivoting maps a set of rows in the input data to single or multiple output columns. The array
size determines the number of rows in the output data. The output data of the vertical pivot action
typically has more columns, but fewer rows than the input data.

Input Data:

CUSTOMER_ID CUSTOMER_NAME Q1EXPENSE
1 UMA 100
1 UMA 200
1 UMA 300
2 POOJITHA 200
2 POOJITHA 300
2 POOJITHA 400

Output Data:

CUSTOMER_ID CUSTOMER_NAME JAN_EXPENSE FEB_EXPENSE MAR_EXPENSE Q1AVERAGE
1 UMA 100 200 300 200
2 POOJITHA 200 300 400 300

Now open Pivot enterprise stage and go the properties tab on stage page:

Go to pivotproperties tab on the stage page and click load button to load metadata:

Now specify array size as 3 because we are have 3 months expenses in the quarter expense of in the input data and we need to convert into 1 row.

Configure Mapping tab in the output page of the stage.

Compile and run the job:

Advertisements
Categories: Datastage-Stages

The Aggregator Stage—Datastage

April 27, 2011 Leave a comment

The Aggregator Stage:

Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By Default Aggregator stage will execute in parallel mode in parallel jobs.

Note:In a Parallel environment ,the way that we partition data before grouping and summary will affect the results.If you parition data using round-robin method and then records with same key values will distruute across different partiions and that will give in correct results.

Aggregation Method:

Aggregator stage has two different aggregation Methods.

1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory.

2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires presorted data, but only maintains the calculations for the current group in memory.

Aggregation Data Type:

By default aggregator stage calculation output column is double data type and if you want decimal output then add following property as shown in below figure.

If you are using single key column for the  grouping keys then there is no need to sort or hash partition the incoming data.

Categories: Datastage-Stages

Reading File using Sequential File Stage

April 26, 2011 6 comments

Sequential File:

The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:

The stage executes in parallel mode by default if reading multiple files but executes sequentially if it is only reading one file.

In order read a sequential file datastage needs to know about the format of the file.

If you are reading a delimited file you need to specify delimiter in the format tab.

Reading Fixed width File:

Double click on the sequential file stage and go to properties tab.

Source:

File:Give the file name including path

Read Method:Whether to specify filenames explicitly or use a file pattern.

Important Options:

First Line is Column Names:If set true, the first line of a file contains column names on writing and is ignored on reading.

Keep File Partitions:Set True to partition the read data set according to the organization of the input file(s).

Reject Mode: Continue to simply discard any rejected rows; Fail to stop if any row is rejected; Output to send rejected rows down a reject link.

For fixed-width files, however, you can configure the stage to behave differently:
*  You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems.
*  You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read.

These two options are mutually exclusive.

Scenario 1:

Reading file sequentially.

Scenario 2:

Read From Multiple Nodes = Yes

Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.

If you run the job with above configuration it will abort with following fatal error.

sff_SourceFile: The multinode option requires fixed length records.(That means you can use this option to read fixed width files only)

In order to fix the above issue go the format tab and add additions parameters as shown below.

Now job finished successfully and please below datastage monitor for performance improvements compare with reading from single node.

Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode option to improve the read performance and once we add this option sequential file stage will execute in default parallel mode.

If we are reading from and writing to fixed width file it is always good practice to add APT_STRING_PADCHAR Datastage Env variable and assign 0x20 as default value then it will pad with spaces ,otherwise datastage will pad null value(Datastage Default padding character).

Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source systems.

Categories: Datastage-Stages

Parallel Checksum stage in Datastage 8.5

April 23, 2011 1 comment

CheckSum Stage:

Use the Checksum stage to generate a checksum value from the specified columns in a row and add the
checksum to the row.

Adding a checksum column to your data

You can add a checksum column to your data by adding a Checksum stage in your data flow.

Procedure
To add a checksum column to your data:
1. Place a Checksum stage on your job design canvas and position it so that it receives data from your
data source, and writes it to your data target.

2. Open the checksum stage editor and optionally supply values for the following properties:
Option Description

Properties for Checksum Stage:

Computation Mode:

Use this property to specify whether the checksum value is generated from all available columns
(the default), all columns except the ones specified, or only the columns specified.

If you select Use all columns except those specified, the following property is displayed:

Exclude Column:Select a column from the Exclude columns list. You can repeat this property to exclude
multiple columns.

If you select Use only columns specified, the following property is displayed:

Compute Column
Select a column from the Compute columns list. You can repeat this property to include
multiple columns in the computation of the checksum.

Optional properties

Options 2:Buffer Output Column Name

Type: Output Column
Specifies the output column name that contains the buffer the checksum algorithm was run with. When not specified, this column is not produced.

Option 3:Checksum Output Column Name

Type: Output Column

Specifies the output column name containing the checksum value. When not specified, the column name defaults to ‘checksum’.

Job Diagram

Mapping

Output

Categories: Datastage-Stages