Archive

Archive for the ‘Datastage-Stages’ Category

Surrogate Key Generator stage — Infosphere Information Server 8.5

January 16, 2012 Leave a comment

The Surrogate Key Generator stage is a processing stage that generates surrogate key columns and maintains the key source.

A surrogate key is a unique primary key that is not derived from the data that it represents, therefore changes to the data will not change the primary key. In a star schema database, surrogate keys are used to join a fact table to a dimension table.

surrogate key generator stage uses:

  • Create or delete the key source before other jobs run
  • Update a state file with a range of key values
  • Generate surrogate key columns and pass them to the next stage in the job
  • View the contents of the state file

Generated keys are 64 bit integers and the key source can be stat file or database sequence.

Creating the key source

Drag the surrogate key stage from palette to parallel job canvas with no input and output links.

Double click on the surrogate key stage and click on properties tab.

Properties:

Key Source Action = create

Source Type : FlatFile or Database sequence(in this case we are using FlatFile)

When you run the job it will create an empty file.

If you want to the check the content change the View Stat File = YES and check the job log for details.

skey_genstage,0: State file /tmp/skeycutomerdim.stat is empty.

if you try to create the same file again job will abort with the following error.

skey_genstage,0: Unable to create state file /tmp/skeycutomerdim.stat: File exists.

Deleting the key source:

Updating the stat File:

To update the stat file add surrogate key stage to the job with single input link from other stage.

We use this process to update the stat file if it is corrupted or deleted.

1)open the surrogate key stage editor and go to the properties tab.

If the stat file exists we can update otherwise we can create and update it.

We are using SkeyValue parameter to update the stat file using transformer stage.

Generating Surrogate Keys:

Now we have created stat file and will generate keys using the stat key file.

Click on the surrogate keys stage and go to properties add add type a name for the surrogate key column in the Generated Output Column Name property

Go to ouput and define the mapping like below.

Rowgen we are using 10 rows and hence when we run the job we see 10 skey values in the output.

I have updated the stat file with 100 and below is the output.

If you want to generate the key value from begining you can use following property in the surrogate key stage.

  1. If the key source is a flat file, specify how keys are generated:
    • To generate keys in sequence from the highest value that was last used, set the Generate Key from Last Highest Value property to Yes. Any gaps in the key range are ignored.
    • To specify a value to initialize the key source, add the File Initial Value property to the Options group, and specify the start value for key generation.
    • To control the block size for key ranges, add the File Block Size property to the Options group, set this property toUser specified, and specify a value for the block size.
  2. If there is no input link, add the Number of Records property to the Options group, and specify how many records to generate.

 

Advertisements
Categories: Datastage-Stages

Filter Stage

April 30, 2011 3 comments

Filter Stage:

Filter stage is a processing stage used to filter database based on filter condition.

The filter stage is configured by creating expression in the where clause.

Scenario1:Check for empty values in the customer name field.We are reading from sequential file and hence we should check for empty value instead of null.

Scenario 2:Comparing incoming fields.check transaction date falls between strt_dt and end_dt and filter those records.

Input Data:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT
1 UMA 1/1/2010 5/20/2010 12/20/2010
1 UMA 5/28/2011 5/20/2010 12/20/2010

Output:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT
1 UMA 5/28/2011 5/20/2010 12/20/2010

Reject:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT STR_DT END_DT
1 UMA 1/1/2010 5/20/2010 12/20/2010


Partition data based on CUSTOMER_ID to make sure all rows with same key values process on the same node.

Condition : where TRANSACTION_DT Between STRT_DT and END_DT

Actual Output:

Actual Reject Data:

Scenario 3:Evaluating input column data

ex:Where CUSTOMER_NAME=’UMA’ AND CUSTOMER_ID=’1′

Output :

Reject :


This covers  most filter stage scenarios.

Categories: Datastage-Stages

Funnel Stage

April 30, 2011 Leave a comment

Funnel Stage:

Funnel stage is used to combine multiple input datasets into a single input dataset.This stage can have any number of input links and single output link.

It operates in 3 modes:

Continuous Funnel combines records as they arrive (i.e. no particular order);

Sort Funnel combines the input records in the order defined by one or more key fields;

Sequence copies all records from the first input data set to the output data set, then all the records from the second input data set, etc.

Note:Metadata for all inputs must be identical.

Sort funnel requires data must be sorted and partitioned by the same key columns as to be used by the funnel operation.

Hash Partition guarantees that all records with same key column values are located in the same partition and are processed in the same node.

1)Continuous funnel:

Go to the properties of the funnel stage page and set Funnel Type to continuous funnel.

2)Sequence:


Note:In order to use sequence funnel you need to specify which order the input links you need to process and also make sure the stage runs in sequential mode.

Usually we use sequence funnel when we create a file with header,detail and trailer records.

3)Sort Funnel:

Note: If you are running your sort funnel stage in parallel, you should be aware of the various
considerations about sorting data and partitions

Thats all about funnel stage usage in datastage.


Categories: Datastage-Stages

Datastage External Filter Stage

April 30, 2011 Leave a comment

External Filter stage:

External filter stage is a processing stage allows you to specify a UNIX command that acts as a filter on the data we are processing.

An example is use a grep command for a certain string or pattern and discards records which did not contain match.

Input Data:

CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 KATRU
3 POOJI

Now i want to use grep command to filter input data which contains “UMA” in the customer_name field.

Command : grep -w “UMA”

Interesting i found one issue with stage also.I have added a parameter DBPasswd as encrypted in the parameters page and used cat command in the filter stage and it is displaying the password.

In the log i found that it is decrypting the passwd like below.

stg_External_Filter,0: Wrapped Unix command ‘cat dsblog’ terminated with error, exit code 1.stg_External_Filter.

I don’t know whether it is datastage bug or it is the way the stage is behaving.

Thanks

Categories: Datastage-Stages

Merge Stage

April 30, 2011 1 comment

Merge Stage:

The Merge stage is a processing stage. It can have any number of input links, a single output link, and the same number of reject links as there are update input links.(according to DS documentation)

Merge stage combines a mster dataset with one or more update datasets based on the key columns.the output record contains all the columns from master record plus any additional columns from each update record that are required.

A master record and update record will be merged only if both have same key column values.

The data sets input to the Merge stage must be key partitioned and sorted. This ensures that rows with the same key column values are located in the same partition and will be processed by the same node. It also minimizes memory requirements because fewer rows need to be in memory at any one time.

As part of preprocessing your data for the Merge stage, you should also remove duplicate records from the master data set. If you have more than one update data set, you must remove duplicate records from the update data sets as well.

Unlike Join stages and Lookup stages, the Merge stage allows you to specify several reject links. You can route update link rows that fail to match a master row down a reject link that is specific for that link. You must have the same number of reject links as you have update links. The Link Ordering tab on the Stage page lets you specify which update links send rejected rows to which reject links. You can also specify whether to drop unmatched master rows, or output them on the output data link.

Example :

Master dataset:
CUSTOMER_ID CUSTOMER_NAME
1 UMA
2 POOJITHA
Update dataset1
CUSTOMER_ID CITY ZIP_CODE SEX
1 CYPRESS 90630 M
2 CYPRESS 90630 F
Output:
CUSTOMER_ID CUSTOMER_NAME CITY ZIP_CODE SEX
1 UMA CYPRESS 90630 M
2 POOJITHA CYPRESS 90630 F

Merge stage configuration steps:

Options:

Unmatched Masters Mode:Keep means that unmatched rows (those without any updates) from the master link are output; Drop means that unmatched rows are dropped instead.

Warn On Reject Updates:True to generate a warning when bad records from any update links are rejected.

Warn On Unmatched Masters:True to generate a warning when there are unmatched rows from the master link.

Partitioning:Hash on both master input and update input as shown below:

Compile and run the job :

Scenario 2:

Remove a record from the updateds1 and check the output:

Check for the datastage warning in the job log as we have selected Warn on unmatched masters = TRUE

stg_merge,0: Master record (0) has no updates.

stg_merge,1: Update record (1) of data set 1 is dropped; no masters are left.

Scenarios 3:Drop unmatched master record and capture reject records from updateds1

Scenario 4:Insert a duplicate record with same customer id in the master dataset and check for the results.

Look at the output and it is clear that merge stage automatically dropped the duplicate record from master dataset.

Scenario 4:Added new updatedataset2 which contains following data.

Update Dataset2
CUSTOMER_ID CITIZENSHIP
1 INDIAN
2 AMERICAN

Still we have duplicate row in the master dataset.if you compile the job with above design you will get compilation error like below.

If you look ate the above figure you can see 2 rows in the output becuase we have a matching row for the customer_id = 2 in the updateds2 .

Scenario 5:add a duplicate row for customer_id=1 in updateds1 dataset.

Now we have duplicate record both in master dataset and updateds1.Run the job and check the results and warnings in the job log.

No change the results and merge stage automatically dropped the duplicate row.

Scenario 6:modify a duplicate row for customer_id=1 in updateds1 dataset with zipcode as 90630 instead of 90620.

Run the job and check output results.

I ran the same job multiple times and found the merge stage is taking first record coming as input from the updateds1 and dropping the next records with same customer id.

This post covered most of the merge scenarios.

Categories: Datastage-Stages

Lookup Stage

April 30, 2011 1 comment

Lookup Stage:

The Lookup stage is most appropriate when the reference data for all lookup stages in a job is small enough to fit into available physical memory. Each lookup reference requires a contiguous block of shared memory. If the Data Sets are larger than available memory resources, the JOIN or MERGE stage should be used.

Lookup stages do not require data on the input link or reference links to be sorted. Be aware, though, that large in-memory lookup tables will degrade performance because of their paging requirements. Each record of the output data set contains columns from a source record plus columns from all the corresponding lookup records where corresponding source and lookup records have the same value for the lookup key columns. The lookup key columns do not have to have the same names in the primary and the reference links.

The optional reject link carries source records that do not have a corresponding entry in the input lookup tables.

You can also perform a range lookup, which compares the value of a source column to a range of values between two lookup table columns. If the source column value falls within the required range, a row is passed to the output link. Alternatively, you can compare the value of a lookup column to a range of values between two source columns. Range lookups must be based on column values, not constant values. Multiple ranges are supported.

There are some special partitioning considerations for Lookup stages. You need to ensure that the data being looked up in the lookup table is in the same partition as the input data referencing it. One way of doing this is to partition the lookup tables using the Entire method.

Lookup stage Configuration:Equal lookup


You can specify what action need to perform if lookup fails.

Scenario1: Continue

Choose entire partition on the reference link

Scenario2:Fail

Job aborted with the following error:

stg_Lkp,0: Failed a key lookup for record 2 Key Values: CUSTOMER_ID: 3

Scenari03:Drop

Scenario4:Reject

If we select reject as lookup failure condition then we need to add reject link otherwise we get compilation error.

Range Lookup:

Business scenario:we have input data with customer id and customer name and transaction date.We have customer dimension table with customer address information.Customer can have multiple records with different start and active dates and we want to select the record where incoming transcation date falls between start and end date of the customer from dim table.

Ex Input Data:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT
1 UMA 2011-03-01
1  UMA 2010-05-01

Ex Di Data:

CUSTOMER_ID CITY ZIP_CODE START_DT END_DT
1 BUENA PARK 90620 2010-01-01 2010-12-31
1 CYPRESS 90630 2011-01-01 2011-04-30

Expected Output:

CUSTOMER_ID CUSTOMER_NAME TRANSACTION_DT CITY ZIP_CODE
1 UMA 2011-03-01 CYPRESS 90630
1  UMA 2010-05-01  BUENA PARK 90620

Configure the lookup stage as shown below.Double click on Lnk_input.TRANSACTION_DATE column.(specifying condition on the input link)

You need to specify return multiple rows from the reference link otherwise you will get following warning in the job log.Even though we have two distinct rows base on customer_id,start_dt and end_dt columns but datastage is considering duplicate rows based on customer_id key only.

stg_Lkp,0: Ignoring duplicate entry; no further warnings will be issued for this table

Compile and Run the job:

Scenario 2:Specify range on reference link:

This concludes lookup stage configuration for different scenarios.

Categories: Datastage-Stages

Remove Duplicates Stage

April 29, 2011 Leave a comment

Remove Duplicates Stage:

In this article I will explain how to use remove duplicate stage to select unique record from the incoming data based on keys.

Input :

CUSTOMER_ID CUSTOMER_NAME CUSTOMER_CITY DATE
1 UMA BUENA PARK 20100101
1 UMA CYPRESS 20110101

If you look at above input we have two records with same customer name.

Business Logic:Select a most recent customer record.

we need to remove duplicates from the incoming data based on customer_name and customer_id keys and select most recent record base on the date.

Expected Output:(most recent record based on Date)

CUSTOMER_ID CUSTOMER_NAME CUSTOMER_CITY DATE
1 UMA CYPRESS 20110101

In order to implement the above business logic we need to use Remove Duplicates Stage in our datastage job design.

We are using sequential file stage to read the input data.

Step1) configure sequential file stage

We have defined data as datatype for DATE field and by default datastage uses %yyyy-%mm-%dd as date format string.We need to double click on row 4 in the above screen and configure date format as %yyyy%mm%dd.

Now add remove duplicate stage to the job canvas by drag and drop from palette.Double click on remove dup stage  and go to the properties tab on the stage page.

select keys the define the duplicates based on the given business logic.In our case CUSTOMER_ID and CUSTOMER_NAME are keys that defined duplicates in the data.

We need to sort and partition the data on the duplicate keys to make sure ros with same keys should go the same datastage partition node.
Go to the partition tab in the input page of the rem dup stage.

Partition Type:Hash

Now we need to sort the data on date column(No need to partition) in order to select single record with latest date.

Please make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows.

Now go the properties tab in the stage page and select Duplicate To Retain = Last.we configured DATE sorting in ascending that is the reason i have selected Duplicate To Retain = Last as shown in below figure.

Now add sequential file stage as output stage to create a output file and that completes our job design as shown like below.

Compile and run the job:

Now look at the director log to see if there are any warnings.

I found following warning in the job log:

stg_out: When checking operator: A sequential operator cannot preserve the partitioning
 of the parallel data set on input port 0

To remove this warning from the job log go to the Advances tab in the stage page and clear preserver partition flag.

Now compile and run the job.

This job finished successfully with out any warnings.

Now i will change the partition and sort order of the columns in the partition tab of the stage page:

I ran the job with the above change and got the following warning the job log:

stg_remDup: When checking operator: User inserted sort “stg_remDup.Lnk_Input_remdup_Sort” does not fulfill the sort requirements of the downstream operator “stg_remDup”

So we need to make sure the order of the partition and sort key columns should be same with keys that define the duplicate rows to avoid above warning and incorrect results in the output.

Categories: Datastage-Stages