Archive

Archive for April, 2011

The Aggregator Stage—Datastage

April 27, 2011 Leave a comment

The Aggregator Stage:

Aggregator stage is a processing stage in datastage is used to grouping and summary operations.By Default Aggregator stage will execute in parallel mode in parallel jobs.

Note:In a Parallel environment ,the way that we partition data before grouping and summary will affect the results.If you parition data using round-robin method and then records with same key values will distruute across different partiions and that will give in correct results.

Aggregation Method:

Aggregator stage has two different aggregation Methods.

1)Hash:Use hash mode for a relatively small number of groups; generally, fewer than about 1000 groups per megabyte of memory.

2)Sort: Sortmode requires the input data set to have been partition sorted with all of the grouping keys specified as hashing and sorting keys.Unlike the Hash Aggregator, the Sort Aggregator requires presorted data, but only maintains the calculations for the current group in memory.

Aggregation Data Type:

By default aggregator stage calculation output column is double data type and if you want decimal output then add following property as shown in below figure.

If you are using single key column for the  grouping keys then there is no need to sort or hash partition the incoming data.

Categories: Datastage-Stages

Datastage Common Errors-Warnings and resolution

April 27, 2011 3 comments

1)When we use same partitioning in datastage transformer stage we get the following warning in 7.5.2 version.

TFCP000043      2       3       input_tfm: Input dataset 0 has a partitioning method other than entire specified; disabling memory sharing.

This is known issue and you can safely demote that warning into informational by adding this warning to Project specific message handler.

2) Warning: A sequential operator cannot preserve the partitioning of input data set on input port 0

Resolution: Clear the preserve partition flag before Sequential file stages.

3)DataStage parallel job fails with fork() failed, Resource temporarily unavailable

On aix execute following command to check maxuproc setting and increase it if you plan to run multiple jobs at the same time.

lsattr -E -l sys0 | grep maxuproc
maxuproc        1024               Maximum number of PROCESSES allowed per user      True

4)TFIP000000              3       Agg_stg: When checking operator: When binding input interface field “CUST_ACT_NBR” to field “CUST_ACT_NBR”: Implicit conversion from source type “string[5]” to result type “dfloat”: Converting string to number.

Resolution: use the Modify stage explicitly convert the data type before sending to aggregator stage.

5)Warning: A user defined sort operator does not satisfy the requirements.

Resolution:check the order of sorting columns and make sure use the same order when use join stage after sort to joing two inputs.

6)TFTM000000      2       3      Stg_tfm_header,1: Conversion error calling conversion routine timestamp_from_string data may have been lost

TFTM000000              1       xfmJournals,1: Conversion error calling conversion routine decimal_from_string data may have been lost

Resolution:check for the correct date format or decimal format and also null values in the date or decimal fields before passing to datastage StringToDate, DateToString,DecimalToString or StringToDecimal functions.

7)TOSO000119      2       3      Join_sort: When checking operator: Data claims to already be sorted on the specified keys the ‘sorted’ option can be used to confirm this. Data will be resorted as necessary. Performance may improve if this sort is removed from the flow

Resolution: Sort the data before sending to join stage and check for the order of sorting keys and join keys and make sure both are in the same order.

8)TFOR000000      2       1       Join_Outer: When checking operator: Dropping component “CUST_NBR” because of a prior component with the same name.

Resolution:If you are using join,diff,merge or comp stages make sure both links have the differnt column names other than key columns

9)TFIP000022              1       oci_oracle_source: When checking operator: When binding output interface field “MEMBER_NAME” to field “MEMBER_NAME”: Converting a nullable source to a non-nullable result;

Resolution:If you are reading from oracle database or in any processing stage where incoming column is defined as nullable and if you define metadata in datastage as non-nullable then you will get above issue.if you want to convert a nullable field to non  nullable make sure you apply available null functions in datastage or in the extract query.

Null function in oracle:NVL,NVL2

Datastage:IsNull,NullToEmpty,NullToZero

Datastage Schema File usage

April 27, 2011 2 comments

In Datastage Schemas are an alternative way for you to specify column definitions for the data used by parallel jobs.

Schema format:

The following schema format is used to read a fixed width file:

//Schema File  is used to read Input data with out specifyinh metadata in the Sequential File stage
//Creatd On : 11/17/2010
//Creatd By : Uma M Katru
record
{final_delim=end,delim=none}
(
CUSTOMER_SSN:STRING[11];
CUSTOMER_NAME:STRING[30];
CUSTOMER_CITY::STRING[40];
CUSTOMER_ZIPCODE:STRING[10];
)

Now we use sequential file stage to read data from the datafile using schemafile.

In order to use schema file concept we need to enable runtime propagation in the job properties.

In the sequential file stage add Schema File option and give schema file name including path as shown like below.

Sorting In Datastage

April 26, 2011 1 comment

In datastage we have two types of sorting methods available in parallel jobs.

1)Link Sort

2)Sort Stage

By default both metods use the same sort package (datastage tsort operator)

The Link sort offers fewer options but is easy to maintain in a datastage job as there are only few stages in the datastage job canvas.

The stand alone sort offers more options.

Use the Link sort unless a specific option is needed on the stand-alone Stage. Most often, the standalone Sort stage is used to specify the Sort Key mode for partial sorts.

1)Link Sort

Sorting on a link option is specified on the input/partitioning stage options,when specifying a keyed partition method and this option is not available if you use auto partition method:

Additional options can be specified by right clicking on the key column as shown in below figure.

Key Column Usage:

1)Sorting only

2)Partitioning only

3)Sorting,Partitioning

If you want to remove duplicates you can check Unique option instead of using Remove duplicates stage.

2)Sort Stage:

Standalone sort stage offers more options which are not available in Link Sort.

the following properties are not available when sorting on a link:

Sort Key Mode:Which is very important performance improvement option

Create Cluster Key Change Column:True creates the column clusterKeyChange. This column is set to 1 for the first row in each group where groups are defined by using a Sort Key Mode of Don’t Sort. Subsequent records in the group have the column set to 0.

Create Key Change Column:

Output Statistics:

Sort Utility:(Never change this)

Restrict Memory Usage:Causes the stage to restrict itself to the specified number of megabytes of virtual memory on a processing node.

We use Sort Key Mode option most frequently.

Stable Sort
Stable sorts preserve the order of non-key columns within each sort group. This requires some
additional overhead in the sort algorithm, and thus a stable sort is generally slower than a non-stable
sort for the same input Data Set and sort keys. For this reason, disable Stable sort unless needed.

It is important to note that by default the Stable sort option is disabled for sorts on a link and Enabled
with the standalone Sort stage.

Sub-Sorts
Within the standalone Sort stage, the key column property “Sort Key Mode” is a particularly powerful
feature and a significant performance optimization. It is used when resorting a sub-grouping of a
previously sorted input Data Set, instead of performing a complete Sort. This “subsort” uses
significantly less disk space and CPU resource, and can often be performed in memory (depending on
the size of the new subsort groups).
To resort based on a sub-grouping, all key columns must still be defined in the Sort stage. Re-used sort
keys are specified with the “Don’t Sort (Previously Sorted)” property, while new sort keys are
specified with the “Sort” key mode property, as shown in the following example:

Automatically inserted sorts:

By default data datastage parallel jobs inserts sort operators as necessary to ensure correct results.We can identify this in the datastage director by adding APT_DUMP_SCORE env variable in the job.

op2[4p] {(parallel inserted tsort operator {key={value=CUSTOMER_ID, subArgs={asc}}})
on nodes (
node1[op2,p0]
node2[op2,p1]
node3[op2,p2]
node4[op2,p3]
)}

Typically datastage inserts sort operator before any stage that requires sorted input if we don’t sort the data before passing to the stage.

examples:

Join, Merge, Remove Duplicates, Sort Aggregator

There are two ways to prevent Enterprise Edition from inserting an un-necessary sort:
1) Insert an upstream Sort stage on each link, define all sort key columns with the Sort Mode
key property “Don’t Sort (Previously Sorted)”

2) Set the environment variable APT_SORT_INSERTION_CHECK_ONLY. This will verify sort
order but not actually perform a sort, aborting the job if data is not in the required sort order

Tuning Sort:

Sort is very expensive task in datastage as it requires Cpu,Memory and disk resources. Datastage uses memory to sort input data if all fits into memory otherwise it uses scratch space defined the configuration file.

By default each sort uses 20MB memory per partition and this value can be changed in each stand alone sort stage by adding “Restrict Memory Usage” option.

We can use APT_TSORT_STRESS_BLOCKSIZE  to specify the size of the memory buffer, in MB, for all sort operators (link and standalone), overriding any per-sort specifications.

The file system configuration and number of scratch disks defined in parallel configuration file can
greatly impact the I/O performance of a parallel sort. Having a greater number of scratch disks for each
node allows the sort to spread I/O across multiple file systems

Generate Surrogate Key without Surrogate Key Stage

April 26, 2011 2 comments

Use the following formula in Transformer stage to generate a surrogate key.

@PARTITIONNUM + (@NUMPARTITIONS * (@INROWNUM  – 1)) + 1

Reading File using Sequential File Stage

April 26, 2011 6 comments

Sequential File:

The Sequential File stage is a file stage. It allows you to read data from or write
data to one or more flat files as shown in Below Figure:

The stage executes in parallel mode by default if reading multiple files but executes sequentially if it is only reading one file.

In order read a sequential file datastage needs to know about the format of the file.

If you are reading a delimited file you need to specify delimiter in the format tab.

Reading Fixed width File:

Double click on the sequential file stage and go to properties tab.

Source:

File:Give the file name including path

Read Method:Whether to specify filenames explicitly or use a file pattern.

Important Options:

First Line is Column Names:If set true, the first line of a file contains column names on writing and is ignored on reading.

Keep File Partitions:Set True to partition the read data set according to the organization of the input file(s).

Reject Mode: Continue to simply discard any rejected rows; Fail to stop if any row is rejected; Output to send rejected rows down a reject link.

For fixed-width files, however, you can configure the stage to behave differently:
*  You can specify that single files can be read by multiple nodes. This can improve performance on cluster systems.
*  You can specify that a number of readers run on a single node. This means, for example, that a single file can be partitioned as it is read.

These two options are mutually exclusive.

Scenario 1:

Reading file sequentially.

Scenario 2:

Read From Multiple Nodes = Yes

Once we add Read From Multiple Node = Yes then stage by default executes in Parallel mode.

If you run the job with above configuration it will abort with following fatal error.

sff_SourceFile: The multinode option requires fixed length records.(That means you can use this option to read fixed width files only)

In order to fix the above issue go the format tab and add additions parameters as shown below.

Now job finished successfully and please below datastage monitor for performance improvements compare with reading from single node.

Scenario 3:Read Delimted file with By Adding Number of Readers Pernode instead of multinode option to improve the read performance and once we add this option sequential file stage will execute in default parallel mode.

If we are reading from and writing to fixed width file it is always good practice to add APT_STRING_PADCHAR Datastage Env variable and assign 0×20 as default value then it will pad with spaces ,otherwise datastage will pad null value(Datastage Default padding character).

Always Keep Reject Mode = Fail to make sure datastage job will fail if we get from format from source systems.

Categories: Datastage-Stages

Parallel Checksum stage in Datastage 8.5

April 23, 2011 1 comment

CheckSum Stage:

Use the Checksum stage to generate a checksum value from the specified columns in a row and add the
checksum to the row.

Adding a checksum column to your data

You can add a checksum column to your data by adding a Checksum stage in your data flow.

Procedure
To add a checksum column to your data:
1. Place a Checksum stage on your job design canvas and position it so that it receives data from your
data source, and writes it to your data target.

2. Open the checksum stage editor and optionally supply values for the following properties:
Option Description

Properties for Checksum Stage:

Computation Mode:

Use this property to specify whether the checksum value is generated from all available columns
(the default), all columns except the ones specified, or only the columns specified.

If you select Use all columns except those specified, the following property is displayed:

Exclude Column:Select a column from the Exclude columns list. You can repeat this property to exclude
multiple columns.

If you select Use only columns specified, the following property is displayed:

Compute Column
Select a column from the Compute columns list. You can repeat this property to include
multiple columns in the computation of the checksum.

Optional properties

Options 2:Buffer Output Column Name

Type: Output Column
Specifies the output column name that contains the buffer the checksum algorithm was run with. When not specified, this column is not produced.

Option 3:Checksum Output Column Name

Type: Output Column

Specifies the output column name containing the checksum value. When not specified, the column name defaults to ‘checksum’.

Job Diagram

Mapping

Output

Categories: Datastage-Stages

IBM InfoSphere DataStage 8.5 New features

April 23, 2011 Leave a comment

New features and changes in IBM InfoSphere Information Server Datastage, Version 8.5

Enabling Operational Repository logging for a project:

You can now specify whether the jobs in a project write logging information to the operational repository when they run.

Figure 2

Missing columns in the Data Set stage:

You can now specify what action is taken for columns that are defined on the link but are not contained in the source data set.

ex:I have create a det set with ID column only as shown below.

In the next job i am reading the same dataset but specified one more column called NAME which is not present the source dataset created by previous job.

We  now have new option available in the dataset properties called Missing Column mode as shown below.

Copying roles from an existing project:

You can now specify that a new project will inherit users, groups, and associated roles from an existing project.

Environment variables created during import:

I have a parallel job developed in 7.5 with the environment variables like TgtDB,TgtUsrID,TgtPswd and TgtSchema.When i imported the same job into datastage 8.5  project the environment variables do not already exist in the
project, are automatically created.

Ex :

Importing Table Definitions

April 15, 2011 2 comments

DataStage is a metadata driven data integration environment. The data that DataStage processes is
described by metadata objects called Table Definitions, which describe the columns, data types, and
other attributes associated with the source/target data structures

Capturing and defining Table Definitions should be the first step in creation of the data flow. Use
caution when selecting an import mechanism – DataStage/EE is a tightly data-typed environment; in
fact the parallel environment obtains some of its throughput efficiencies by relying on the fact that the
data types are strictly defined.The Table Definition is the principle mechanism in place that defines the
import/export process since it provides the external format the data exists in (Native type) as well as the
internal format that the data is processed in (SQL type) and the conversion requirements between the
two.

The following table describes the various types of source/target data structures, the mechanisms which
can be used to import them through the Manager or Designer client, and the recommended import
mechanism.

Figure1

In 8.5 we can use Connector import wizards to import metadata from databases Oracle,DB2,Teradata and ODBC connections.

Figure2

Figure3

Follow

Get every new post delivered to your Inbox.

Join 181 other followers