ReferenceFileTypes

Concept

Etlunit uses a concept of reference file types which are core to how it handles data in almost every context. Some examples of when reference file types are used:

The database stage operation - putting data into a database table for testing.
The database assertion operation - reading data from a database table and comparing to an expected data set for comparison.
The file assertion operation - taking a file written by a process and comparing to an expected data set for comparison.

When discussing ETL processes - especially testing - the data inputs for the test and the assertions performed on the result of the data transformation are the key features of a testing platform. In etlunit, we take that seriously and the discussion that follows should help to fully understand the process so that your tests are well understood and stable - keys to agile unit testing and continuous integration success.

Determining effective reference file type

Whenever a reference file type is required, etlunit uses the following algorithm to determine which one to use.

If a reference-file-type is specified, that is used.
If the operation is operating on a named thing, such as a database table or assertion file, and a reference-file-type exists that has a matching name, then that is used.
If the thing being acted upon has a generic name, then a reference file type matching that name is used. E.G., a database table named Table, in the schema edw, and connection id db, will first match a reference file type named Table.fml, and secondly db-edw-Table.fml.
If there are two data sets involved, and the other data set has been identified using this same process, the same file type will be used for this one.
At this point no file type has been found and may cause an error.

Locating Reference File types

Reference file types are located in the src/main/reference/file/fml/name.fml folder in the project. Any named file type will be located in this folder first. The next step is to search the classpath for reference/file/fml/name.fml.

The Reference File Type

Reference file types are stored as json objects in files named with '.fml' extensions. This is a sample reference file type:

{
  "flat-file" : {
    "format-type" : "delimited",
    "row-delimiter" : "\n",
    "column-delimiter" : "\t",
    "null-token" : null,
    "columns" : [ {
      "id" : "SEQ_",
      "type" : "INTEGER",
      "length" : -1,
      "basic-type" : "integer"
    }, {
      "id" : "FLOAT_",
      "type" : "DOUBLE",
      "length" : -1,
      "basic-type" : "numeric"
    }, {
      "id" : "DECIMAL_",
      "type" : "DECIMAL",
      "length" : -1,
      "basic-type" : "numeric"
    }, {
      "id" : "DOUBLE_",
      "type" : "DOUBLE",
      "length" : -1,
      "basic-type" : "numeric"
    }, {
      "id" : "REAL_",
      "type" : "DOUBLE",
      "length" : -1,
      "basic-type" : "numeric",
	  "default-value": 1
    }, {
      "id" : "NUMERIC_",
      "type" : "NUMERIC",
      "length" : -1,
      "basic-type" : "numeric"
	  "read-only": true
    } ],
    "primaryKey" : [ "SEQ_" ],
    "orderBy" : [ "SEQ_", "FLOAT_", "DECIMAL_", "DOUBLE_", "REAL_", "NUMERIC_" ]
  }
}

The basic components are the file type, fixed or delimited, the delimiters, and the columns that make up the data set.

attribute	values	meaning
format-type	delimited, fixed	How columns are demarcated - either by length as in fixed or a delimiter
row-delimiter		How records are separated.
column-delimiter		How columns are delimited. If specified on a fixed length file, it is an error
null-token		The literal token to use to show a null column value. If not specified, the word 'null' is used.
primaryKey		A list of column names which make up the primary key for the table. This uniqueness is generally enforced, even for files, so it should be correct if specified.
orderBy		A list of columns to order the data set by. Ordering is critical when comparing two different data sources. The order by tells the comparator when two records match so that it can report that two rows match with one column difference, rather than reporting two non-matching columns. For a tutorial on this refer to this page.
columns		A list of the columns in the data set. Must not be missing or empty.
columns.id		The unique identifier for this column.
columns.type		The JDBC type for this column. See the Jdbc Types class documentation for a list of values.
columns.length		The length of this column. Required for fixed length files, helpful in delimited files for enforcing correctness.
columns.basic-type	string, integer, numeric	Used for ordering the data set when be handled as a file.
columns.read-only		An attribute which identifies the column as not writable. Primarily used for staging data into tables which don't accept values.
columns.format		A string identifying the format for this data point. Useful when a file contains a date, e.g., as a packed integer but it is being staged into a table which stores the date as a datetime.
columns.default-value		A value that can be used when none is present. This is distinct from null. If a column is explicitly set to null, null is used - if the data point isn't present at all, then the default value may be used, depending on the context.