Data Sets
Draft in Progress
This document is a draft and is under development.
What are data sets?
- Data sets contain data to be used in an ETLUnit test. They are like data files, in that they contain data, but there are some differences.
- A data file and a data set each live in a text file in the ETLUnit project.
- A data file has a file extension of "
.delimited
" or ".fixed
." - A data set has a file extension of "
.dataset
." - A data file resides in folder "
data/
," a sibling of the ETLUnit test class. - A data set resides in folder "
dataset/
," also a sibling of the ETLUnit test class. - One data set file (one "data set") can contain multiple data set members.
- One data file stands alone. It cannot have multiple sets of data.
- Each data set member is the equivalent to (and could be replaced by, in the ETLUnit test) a data file.
- Properties that are normally written in a test method operation may instead be written as one of the properties of a data set member.
- ETLUnit will use a data set's properties when a test is run that uses the data set member, unless explicitly told not to by the use of the
ignore-data-set-properties
property.
- ETLUnit will use a data set's properties when a test is run that uses the data set member, unless explicitly told not to by the use of the
- TODO: Ressurrect the links below (next two points) when they become available.
- SEE stage() for Data Sets, on the stage page, for syntax.
- SEE assert() for Data Sets, on the assert page, for syntax.
Location of data set files
- Relative to the ETLUnit test class, data sets are placed in a sibling folder called "
dataset/
." - Contrast this location with that for data files. Data files are placed in a sibling folder called "
data/
."
Dataset Folder Location Sample
In the sample below, the folder "dataset" is a sibling of the ETLUnit tests.
Format of data set files
- Data set files may contain one data set member, or several.
- Data set members are delimited within the text file by certain characters.
- A data set member starts with a properties section, delimited by an open and close curly brace.
- Properties may be written between these curly braces.
- If multiple properties are present, they must be separated by commas.
- Multiple properties may exist on the same line or on separate lines.
- If no properties are present, empty curly braces are acceptable ("
{}
"). - A data set member is not legal if it lacks a properties section.
- The data set's
id
property is the only way to assign an ID to a data set member. An ETLUnit operation will refer to the id with propertydata-set-id
. - All the other properties in a data set's property section should be legal properties of the assert or the stage operation.
- It is illegal to have other properties that are not canonized.
- The data section of a data set member follows the properties section.
- A data set member ends with a data section, delimited by this series of angle brackets: "
<><><><><><>
".- The data itself is placed between two lines that each have a series of six pairs of complementary angle brackets.
- A data set member starts with a properties section, delimited by an open and close curly brace.
- There is no special delimiter, like a comma, between data set members.
- Line endings should be LF only (x'000A').
- The last line in the file should be a data section delimiter, terminated by a LF line ending character.
- TODO: Restore the link for sample code on the next line.
- See the sample code below.
Use of data sets
- Data sets may be used in the stage operation or the assert operation.
Data sets do not work in the assert
operation prior to ETLUnit version 3.9.6.
Usage Example
The data sets
- The first data set,
extract_data.dataset
, is to be used in the stage operation. - The data set member's properties section includes a property, '
target-table
.' - When the ETLUnit test runs, it will use this property.
- The data will be placed in a table named
PLANT_BUCKET_COMP
. - This data is tab delimited.
{ id: 'PLANT_BUCKET_COMP_EDW_TBL', target-table: 'PLANT_BUCKET_COMP' } <><><><><><> /*-- PLANT_ID SALES_DATE PLANT_SK THIS_YEAR_COMP THIS_YEAR_PRIOR_YEAR_COMP THIS_YEAR_WEEKLY_COMP THIS_YEAR_PRIOR_WEEKLY_COMP THIS_YEAR_COMP_WITH_ESTIMATES THIS_YEAR_PRIOR_YEAR_COMP_WITH_ESTIMATES THIS_YEAR_WEEKLY_COMP_WITH_ESTIMATES THIS_YEAR_PRIOR_YEAR_WEEKLY_COMP_WITH_ESTIMATES POPULATION_DATE --*/ /*-- INTEGER TIMESTAMP BIGINT CHAR CHAR CHAR CHAR CHAR CHAR CHAR CHAR TIMESTAMP --*/ 1003 20080202 557701000 Y N Y N Y N Y N 20111122 1003 20090831 557701000 Y N Y N Y N Y N 20111122 1003 20090901 557701000 Y N Y N Y N Y N 20111122 1003 20110501 557701000 Y N Y N Y N Y N 20111122 1003 99990909 557701000 Y N Y N Y N Y N 20111122 <><><><><><>
- The second data set,
assert.dataset
, will be used in the assert operation. - Note that the table headings are commented out, and are optional.
- Their presence or absence does not affect the test at runtime.
- This data is comma delimited.
{ id: 'PLANT_BUCKET_COMP_EDW' } <><><><><><> /*-- PLANT_ID,SALES_DATE,PLANT_SK,THIS_YEAR_COMP,THIS_YEAR_PRIOR_YEAR_COMP,THIS_YEAR_WEEKLY_COMP,THIS_YEAR_PRIOR_WEEKLY_COMP,THIS_YEAR_COMP_WITH_ESTIMATES,THIS_YEAR_PRIOR_YEAR_COMP_WITH_ESTIMATES,THIS_YEAR_WEEKLY_COMP_WITH_ESTIMATES,THIS_YEAR_PRIOR_YEAR_WEEKLY_COMP_WITH_ESTIMATES,POPULATION_DATE --*/ /*-- INTEGER,TIMESTAMP,BIGINT,CHAR,CHAR,CHAR,CHAR,CHAR,CHAR,CHAR,CHAR,TIMESTAMP --*/ 1003,2008-02-02 00:00:00,557701000,Y,N,Y,N,Y,N,Y,N,2011-11-22 00:00:00 1003,2009-08-31 00:00:00,557701000,Y,N,Y,N,Y,N,Y,N,2011-11-22 00:00:00 1003,2009-09-01 00:00:00,557701000,Y,N,Y,N,Y,N,Y,N,2011-11-22 00:00:00 1003,2011-05-01 00:00:00,557701000,Y,N,Y,N,Y,N,Y,N,2011-11-22 00:00:00 1003,9999-09-09 00:00:00,557701000,Y,N,Y,N,Y,N,Y,N,2011-11-22 00:00:00 <><><><><><>
Both of the above data set members could reside in a single data set file. In this example, they are each in their own file.
The ETLUnit test
Note that the stage operation below has no target table. The target table was specified in the first data set, above.
class plant_bucket_comp {
@Database(id: 'edw')
@Test
extractFromEdw(){
// ================================================================================================
// Stage data from a data set.
// To what table I am staging? I will have to look at the data set's properties to find out.
// I will find a target-table property there.
// Alternatively, I could put the target-table property here, in this stage operation.
// If a target-table property exists in both places, the one in the data set member will be used.
// ===============================================================================================
stage(
data-set-name: 'extract_data',
data-set-id: 'PLANT_BUCKET_COMP_EDW_TBL'
);
set(variable: 'population_date', value: '20090901');
execute(
workflow: 'wkf_EXTRACT_PLANT_BUCKET_COMP',
folder: 'EDW_EXTRACTS',
context:
{
edw-connection-src: 'informatica.connections.edw.default'
}
);
// ============================================================================================
// My "expected" data may be found in the file called "assert.dataset" in the "dataset" folder.
// It will reside in the data set member "PLANT_BUCKET_COMP_EDW."
// ============================================================================================
assert(
data-set-name: 'assert',
data-set-id: 'PLANT_BUCKET_COMP_EDW',
source-file: 'PLANT_BUCKET_COMP.txt',
reference-file-type: 'FF_PLANT_BUCKET_COMP'
);
}
}
Rules, Behaviors, Caveats
Data Sets and the assert Operation
- Given only a data set name without an ID, the assert is over each member in the set.
- If a
source-table
property is assigned within the assert operation, the assert operation will compare each member of the data set with the contents of the samesource-table
. - If the
source-table
property is instead assigned in the properties section of each data set member, the operation will compare each member of the data set with the contents of thesource-table
specified there.
- If a
- Given only a data-set name without an ID, if the data set file is not empty and no
source
orsource-table
is provided,ERR_RECURSIVE_ASSERT
occurs. - When asserting over a data set without using an ID, the data set file need have no ID's on any of its members.
In order to assert a source-file against a target data set, you must first use the
register
operation for the source-file. Available only in ETLUnit-3.9.6 and later.Specifying property
ignore-data-set-properties
is useful if you want to use a data set member, without its hard-coded properties.ignore-data-set-properties
does NOT ignore the id.- If the data set has only one member:
- The data set member does not have to have an id for
ignore-data-set-properties
to work. - The assert operation does not have to specify a
data-set-id
.
- The data set member does not have to have an id for
- If the data set has multiple members:
- The assert operation must specify a
data-set-id
. - The data set member whose properties are being ignored must have an
id
matching thedata-set-id
specified in the assert operation.
- The assert operation must specify a
When specifying a
source-table
value in the assert operation different from thesource-table
value in the data set, the value in the data set will prevail.- This applies to any valid property.
Data Sets and the stage Operation
ignore-data-set-properties
requires thatdata-set-id
is present or that data set has only one member.
Practices
- You may stage to multiple tables in one stage operation.
- Include a multi-member data set by
data-set-name
, but without specifying anydata-set-id
. - Every member in the data set will invoke a separate stage operation.
- Each member of the data set should set all the properties for its stage operation.
- Include a multi-member data set by
- You may run multiple asserts in one assert operation.
- Include a multi-member data set by
data-set-name
, but without specifying anydata-set-id
. - Every member in the data set will invoke a separate assert operation.
- Each member of the data set should set all the properties for its assert operation.
- Include a multi-member data set by
- You should not write a meaningless property in the properties section of the data set.
- Select from valid properties:
- '
id
' - Properties valid for the stage operation
- Properties valid for the assert operation
- '
- Select from valid properties:
Caveats
- @OperationDefault annotation does not currently work with data set properties in the stage operation.