Skip to content

API Reference

Top-level package for ETLrules.

backends special

common special

aggregate

AggregateRule (UnaryOpBaseRule)

Performs a SQL-like groupby and aggregation.

It takes a list of columns to group by and the result will have one row for each unique combination of values in the group_by columns.
The rest of the columns (not in the group_by) can be aggregated using either pre-defined aggregations or using custom python expressions.

Parameters:

Name Type Description Default
group_by Iterable[str]

A list of columns to group the result by

required
aggregations Optional[Mapping[str, str]]

A mapping {column_name: aggregation_function} which specifies how to aggregate columns which are not in the group_by list.
The following list of aggregation functions are supported::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
min: minimum of the values in the group
max: minimum of the values in the group
mean: The mathematical mean value in the group
count: How many values are in the group, including NA
countNoNA: How many values are in the group, excluding NA
sum: The sum of the values in the group
first: The first value in the group
last: The last value in the group
list: Produces a python list with all the values in the group, excluding NA
tuple: Like list above but produces a tuple
csv: Produces a comma separated string of values, exluding NA
None
aggregation_expressions Optional[Mapping[str, str]]

A mapping {column_name: aggregation_expression} which specifies how to aggregate columns which are not in the group_by list.
The aggregation expression is a string representing a valid Python expression which gets evaluated.
The input will be in a variable values. isnull can be used to filter out NA.

Example::

1
2
3
{"C": "';'.join(str(v) for v in values if not isnull(v))"}

The above aggregates the column C by producing a ; separated string of values in the group, excluding NA.

The dask backend doesn't support aggregation_expressions.

None
aggregation_types Optional[Mapping[str, str]]

An optional mapping of {column_name: column_type} which converts the respective output column to the given type. The supported types are: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, string, boolean, datetime and timedelta.

None
named_input Optional[str]

Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised if a column appears in multiple places in group_by/aggregations/aggregation_expressions.

ExpressionSyntaxError

raised if any aggregation expression (if any are passed in) has a Python syntax error.

MissingColumnError

raised in strict mode only if a column specified in aggregations or aggregation_expressions is missing from the input dataframe. If aggregation_types are specified, it is raised in strict mode if a column in the aggregation_types is missing from the input dataframe.

UnsupportedTypeError

raised if a type specified in aggregation_types is not supported.

ValueError

raised if a column in aggregations is trying to be aggregated using an unknown aggregate function

TypeError

raised if an operation is not supported between the types involved

NameError

raised if an unknown variable is used

Note

Other Python exceptions can be raised when custom aggregation expressions are used, depending on what the expression is doing.

Note

Any columns not in the group_by list and not present in either aggregations or aggregation_expressions will be dropped from the result.

Source code in etlrules/backends/common/aggregate.py
class AggregateRule(UnaryOpBaseRule):
    """Performs a SQL-like groupby and aggregation.

    It takes a list of columns to group by and the result will have one row for each unique combination
    of values in the group_by columns.  
    The rest of the columns (not in the group_by) can be aggregated using either pre-defined aggregations
    or using custom python expressions.

    Args:
        group_by: A list of columns to group the result by
        aggregations: A mapping {column_name: aggregation_function} which specifies how to aggregate
            columns which are not in the group_by list.  
            The following list of aggregation functions are supported::

                min: minimum of the values in the group
                max: minimum of the values in the group
                mean: The mathematical mean value in the group
                count: How many values are in the group, including NA
                countNoNA: How many values are in the group, excluding NA
                sum: The sum of the values in the group
                first: The first value in the group
                last: The last value in the group
                list: Produces a python list with all the values in the group, excluding NA
                tuple: Like list above but produces a tuple
                csv: Produces a comma separated string of values, exluding NA

        aggregation_expressions: A mapping {column_name: aggregation_expression} which specifies how to aggregate
            columns which are not in the group_by list.  
            The aggregation expression is a string representing a valid Python expression which gets evaluated.  
            The input will be in a variable `values`. `isnull` can be used to filter out NA.

            Example::

                {"C": "';'.join(str(v) for v in values if not isnull(v))"}

                The above aggregates the column C by producing a ; separated string of values in the group, excluding NA.

            The dask backend doesn't support aggregation_expressions.

        aggregation_types: An optional mapping of {column_name: column_type} which converts the respective output
            column to the given type. The supported types are: int8, int16, int32, int64, uint8, uint16,
            uint32, uint64, float32, float64, string, boolean, datetime and timedelta.

        named_input: Which dataframe to use as the input. Optional.  
            When not set, the input is taken from the main output.  
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.  
            When not set, the result of this rule will be available as the main output.  
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.  
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.  
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised if a column appears in multiple places in group_by/aggregations/aggregation_expressions.
        ExpressionSyntaxError: raised if any aggregation expression (if any are passed in) has a Python syntax error.
        MissingColumnError: raised in strict mode only if a column specified in aggregations or aggregation_expressions
            is missing from the input dataframe. If aggregation_types are specified, it is raised in strict mode if a column
            in the aggregation_types is missing from the input dataframe.
        UnsupportedTypeError: raised if a type specified in aggregation_types is not supported.
        ValueError: raised if a column in aggregations is trying to be aggregated using an unknown aggregate function
        TypeError: raised if an operation is not supported between the types involved
        NameError: raised if an unknown variable is used

    Note:
        Other Python exceptions can be raised when custom aggregation expressions are used, depending on what the expression is doing.

    Note:
        Any columns not in the group_by list and not present in either aggregations or aggregation_expressions will be dropped from the result.
    """

    AGGREGATIONS = {}

    EXCLUDE_FROM_COMPARE = ("_aggs",)

    def __init__(
        self,
        group_by: Iterable[str],
        aggregations: Optional[Mapping[str, str]] = None,
        aggregation_expressions: Optional[Mapping[str, str]] = None,
        aggregation_types: Optional[Mapping[str, str]] = None,
        named_input: Optional[str] = None,
        named_output: Optional[str] = None,
        name: Optional[str] = None,
        description: Optional[str] = None,
        strict: bool = True,
    ):
        super().__init__(
            named_input=named_input, named_output=named_output, name=name,
            description=description, strict=strict
        )
        self.group_by = [col for col in group_by]
        assert aggregations or aggregation_expressions, "One of aggregations or aggregation_expressions must be specified."
        if aggregations is not None:
            self.aggregations = {}
            for col, agg_func in aggregations.items():
                if col in self.group_by:
                    raise ColumnAlreadyExistsError(f"Column {col} appears in group_by and cannot be aggregated.")
                if agg_func not in self.AGGREGATIONS:
                    raise ValueError(f"'{agg_func}' is not a supported aggregation function.")
                self.aggregations[col] = agg_func
        else:
            self.aggregations = None
        self._aggs = {}
        if self.aggregations:
            if agg_func in ('list', 'csv'):
                perf_logger.warning("Aggregation '%s' in AggregateRule is not vectorized and might hurt the overall performance", agg_func)
            self._aggs.update({
                key: self.AGGREGATIONS[agg_func]
                for key, agg_func in (aggregations or {}).items()
            })
        if aggregation_expressions is not None:
            self.aggregation_expressions = {}
            for col, agg_expr in aggregation_expressions.items():
                if col in self.group_by:
                    raise ColumnAlreadyExistsError(f"Column {col} appears in group_by and cannot be aggregated.")
                if col in self._aggs:
                    raise ColumnAlreadyExistsError(f"Column {col} is already being aggregated.")
                try:
                    perf_logger.warning("Aggregation expression '%s' in AggregateRule is not vectorized and might hurt the overall performance", agg_expr)
                    _ast_expr = ast.parse(agg_expr, filename=f"{col}_expression.py", mode="eval")
                    _compiled_expr = compile(_ast_expr, filename=f"{col}_expression.py", mode="eval")
                    self._aggs[col] = lambda values, bound_compiled_expr=_compiled_expr: eval(
                        bound_compiled_expr, {"isnull": isnull}, {"values": values}
                    )
                except SyntaxError as exc:
                    raise ExpressionSyntaxError(f"Error in aggregation expression for column '{col}': '{agg_expr}': {str(exc)}")
                self.aggregation_expressions[col] = agg_expr

        if aggregation_types is not None:
            self.aggregation_types = {}
            for col, col_type in aggregation_types.items():
                if col not in self._aggs and col not in self.group_by:
                    if self.strict:
                        raise MissingColumnError(f"Column {col} is neither in the group by columns nor in the aggregations.")
                    else:
                        continue
                if col_type not in SUPPORTED_TYPES:
                    raise UnsupportedTypeError(f"Unsupported type '{col_type}' for column '{col}'.")
                self.aggregation_types[col] = col_type
        else:
            self.aggregation_types = None

    def do_aggregate(self, df, aggs):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        df_columns_set = set(df.columns)
        if not set(self._aggs) <= df_columns_set:
            if self.strict:
                raise MissingColumnError(f"Missimg columns to aggregate by: {set(self._aggs) - df_columns_set}.")
            aggs = {
                col: agg for col, agg in self._aggs.items() if col in df_columns_set
            }
        else:
            aggs = self._aggs
        df = self.do_aggregate(df, aggs)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/aggregate.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    df_columns_set = set(df.columns)
    if not set(self._aggs) <= df_columns_set:
        if self.strict:
            raise MissingColumnError(f"Missimg columns to aggregate by: {set(self._aggs) - df_columns_set}.")
        aggs = {
            col: agg for col, agg in self._aggs.items() if col in df_columns_set
        }
    else:
        aggs = self._aggs
    df = self.do_aggregate(df, aggs)
    self._set_output_df(data, df)

base

BaseAssignColumnRule (UnaryOpBaseRule, ColumnsInOutMixin)
Source code in etlrules/backends/common/base.py
class BaseAssignColumnRule(UnaryOpBaseRule, ColumnsInOutMixin):

    def __init__(self, input_column: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert input_column and isinstance(input_column, str), "input_column must be a non-empty string."
        assert output_column is None or (output_column and isinstance(output_column, str)), "output_column must be None or a non-empty string."
        self.input_column = input_column
        self.output_column = output_column

    def do_apply(self, df, col):
        raise NotImplementedError()

    def apply(self, data: RuleData):
        df = self._get_input_df(data)
        input_column, output_column = self.validate_in_out_columns(df.columns, self.input_column, self.output_column, self.strict)
        df = self.assign_do_apply(df, input_column, output_column)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/base.py
def apply(self, data: RuleData):
    df = self._get_input_df(data)
    input_column, output_column = self.validate_in_out_columns(df.columns, self.input_column, self.output_column, self.strict)
    df = self.assign_do_apply(df, input_column, output_column)
    self._set_output_df(data, df)

basic

DedupeRule (UnaryOpBaseRule)

De-duplicates by dropping duplicates using a set of columns to determine the duplicates.

It has logic to keep the first, last or none of the duplicate in a set of duplicates.

Parameters:

Name Type Description Default
columns Iterable[str]

A subset of columns in the data frame which are used to determine the set of duplicates. Any rows that have the same values in these columns are considered to be duplicates.

required
keep Literal['first', 'last', 'none']

What to keep in the de-duplication process. One of: first: keeps the first row in the duplicate set last: keeps the last row in the duplicate set none: drops all the duplicates

'first'
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised when a column specified to deduplicate on doesn't exist in the input data frame.

Note

MissingColumnError is raised in both strict and non-strict modes. This is because the rule cannot operate reliably without a correct set of columns.

Source code in etlrules/backends/common/basic.py
class DedupeRule(UnaryOpBaseRule):
    """ De-duplicates by dropping duplicates using a set of columns to determine the duplicates.

    It has logic to keep the first, last or none of the duplicate in a set of duplicates.

    Args:
        columns: A subset of columns in the data frame which are used to determine the set of duplicates.
            Any rows that have the same values in these columns are considered to be duplicates.
        keep: What to keep in the de-duplication process. One of:
            first: keeps the first row in the duplicate set
            last: keeps the last row in the duplicate set
            none: drops all the duplicates

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised when a column specified to deduplicate on doesn't exist in the input data frame.

    Note:
        MissingColumnError is raised in both strict and non-strict modes. This is because the rule cannot operate reliably without a correct set of columns.
    """

    KEEP_FIRST = 'first'
    KEEP_LAST = 'last'
    KEEP_NONE = 'none'

    ALL_KEEPS = (KEEP_FIRST, KEEP_LAST, KEEP_NONE)

    def __init__(self, columns: Iterable[str], keep: Literal[KEEP_FIRST, KEEP_LAST, KEEP_NONE]=KEEP_FIRST, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.columns = [col for col in columns]
        assert all(
            isinstance(col, str) for col in self.columns
        ), "DedupeRule: columns must be strings"
        assert keep in self.ALL_KEEPS, f"DedupeRule: keep must be one of: {self.ALL_KEEPS}"
        self.keep = keep

    def do_dedupe(self, df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        if not set(self.columns) <= set(df.columns):
            raise MissingColumnError(f"Missing column(s) to dedupe on: {set(self.columns) - set(df.columns)}")
        df = self.do_dedupe(df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/basic.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    if not set(self.columns) <= set(df.columns):
        raise MissingColumnError(f"Missing column(s) to dedupe on: {set(self.columns) - set(df.columns)}")
    df = self.do_dedupe(df)
    self._set_output_df(data, df)
ExplodeValuesRule (UnaryOpBaseRule, ColumnsInOutMixin)

Explode a list of values into multiple rows with each value on a separate row

Example::

1
2
3
4
5
6
Given df:
| A         |
|-----------|
| [1, 2, 3] |
| [4, 5]    |
| [6]       |

ExplodeValuesRule("A").apply(df)

Result::

1
2
3
4
5
6
7
8
| A   |
|-----|
| 1   |
| 2   |
| 3   |
| 4   |
| 5   |
| 6   |

Parameters:

Name Type Description Default
input_column str

A column with values to round as per the specified scale.

required
column_type Optional[str]

An optional string with the type of the resulting exploded column. When not specified, the column_type is backend implementation specific.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input column doesn't exist in the input dataframe.

Source code in etlrules/backends/common/basic.py
class ExplodeValuesRule(UnaryOpBaseRule, ColumnsInOutMixin):
    """ Explode a list of values into multiple rows with each value on a separate row

    Example::

        Given df:
        | A         |
        |-----------|
        | [1, 2, 3] |
        | [4, 5]    |
        | [6]       |

    > ExplodeValuesRule("A").apply(df)

    Result::

        | A   |
        |-----|
        | 1   |
        | 2   |
        | 3   |
        | 4   |
        | 5   |
        | 6   |

    Args:
        input_column: A column with values to round as per the specified scale.
        column_type: An optional string with the type of the resulting exploded column. When not specified, the
            column_type is backend implementation specific.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input column doesn't exist in the input dataframe.
    """

    def __init__(self, input_column: str, column_type: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.input_column = input_column
        self.column_type = column_type
        if self.column_type is not None and self.column_type not in SUPPORTED_TYPES:
            raise UnsupportedTypeError(f"Type '{self.column_type}' is not supported.")

    def _validate_input_column(self, df):
        if self.input_column not in df.columns:
            raise MissingColumnError(f"Column '{self.input_column}' is not present in the input dataframe.")
ProjectRule (UnaryOpBaseRule)

Reshapes the data frame to keep, eliminate or re-order the set of columns.

Parameters:

Name Type Description Default
columns Iterable[str]

The list of columns to keep or eliminate from the data frame. The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.

required
exclude bool

When set to True, the columns in the columns arg will be excluded from the data frame. Boolean. Default: False In strict mode, if any column specified in the columns arg doesn't exist in the input data frame, a MissingColumnError exception is raised. In non strict mode, the missing columns are ignored.

False
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised in strict mode only, if any columns are missing from the input data frame.

Source code in etlrules/backends/common/basic.py
class ProjectRule(UnaryOpBaseRule):
    """ Reshapes the data frame to keep, eliminate or re-order the set of columns.

    Args:
        columns (Iterable[str]): The list of columns to keep or eliminate from the data frame.
            The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.
        exclude (bool): When set to True, the columns in the columns arg will be excluded from the data frame. Boolean. Default: False
            In strict mode, if any column specified in the columns arg doesn't exist in the input data frame, a MissingColumnError exception is raised.
            In non strict mode, the missing columns are ignored.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised in strict mode only, if any columns are missing from the input data frame.
    """

    def __init__(self, columns: Iterable[str], exclude=False, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.columns = [col for col in columns]
        assert all(
            isinstance(col, str) for col in self.columns
        ), "ProjectRule: columns must be strings"
        self.exclude = exclude

    def _get_remaining_columns(self, df_column_names):
        columns_set = set(self.columns)
        df_column_names_set = set(df_column_names)
        if self.strict:
            if not columns_set <= df_column_names_set:
                raise MissingColumnError(f"No such columns: {columns_set - df_column_names_set}. Available columns: {df_column_names_set}.")
        if self.exclude:
            remaining_columns = [
                col for col in df_column_names if col not in columns_set
            ]
        else:
            remaining_columns = [
                col for col in self.columns if col in df_column_names_set
            ]
        return remaining_columns

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        remaining_columns = self._get_remaining_columns(df.columns)
        df = df[remaining_columns]
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/basic.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    remaining_columns = self._get_remaining_columns(df.columns)
    df = df[remaining_columns]
    self._set_output_df(data, df)
RenameRule (UnaryOpBaseRule)

Renames a set of columns in the data frame.

Parameters:

Name Type Description Default
mapper Mapping[str, str]

A dictionary of old names (keys) and new names (values) to be used for the rename operation The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised in strict mode only, if any columns (keys) are missing from the input data frame.

Source code in etlrules/backends/common/basic.py
class RenameRule(UnaryOpBaseRule):
    """ Renames a set of columns in the data frame.

    Args:
        mapper: A dictionary of old names (keys) and new names (values) to be used for the rename operation
            The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised in strict mode only, if any columns (keys) are missing from the input data frame.
    """

    def __init__(self, mapper: Mapping[str, str], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        assert isinstance(mapper, dict), "mapper needs to be a dict {old_name:new_name}"
        assert all(isinstance(key, str) and isinstance(val, str) for key, val in mapper.items()), "mapper needs to be a dict {old_name:new_name} where the names are str"
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.mapper = mapper

    def do_rename(self, df, mapper):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        mapper = self.mapper
        df_columns = set(df.columns)
        if not set(self.mapper.keys()) <= df_columns:
            if self.strict:
                raise MissingColumnError(f"Missing columns to rename: {set(self.mapper.keys()) - df_columns}")
            else:
                mapper = {k: v for k, v in self.mapper.items() if k in df_columns}
        df = self.do_rename(df, mapper)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/basic.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    mapper = self.mapper
    df_columns = set(df.columns)
    if not set(self.mapper.keys()) <= df_columns:
        if self.strict:
            raise MissingColumnError(f"Missing columns to rename: {set(self.mapper.keys()) - df_columns}")
        else:
            mapper = {k: v for k, v in self.mapper.items() if k in df_columns}
    df = self.do_rename(df, mapper)
    self._set_output_df(data, df)
ReplaceRule (BaseAssignColumnRule)

Replaces some some values (or regular expressions) with another set of values (or regular expressions).

Basic usage::

1
2
3
4
5
6
7
# replaces A with new_A and b with new_b in col_A
rule = ReplaceRule("col_A", values=["A", "b"], new_values=["new_A", "new_b"])
rule.apply(data)

# replaces 1 with 3 and 2 with 4 in the col_I column
rule = ReplaceRule("col_I", values=[1, 2], new_values=[3, 4])
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A column with the input values.

required
values Iterable[Union[int, float, str]]

A sequence of values to replace. Regular expressions can be used to match values more widely, in which case, the regex parameter must be set to True. Values can be any supported types but they should match the type of the columns.

required
new_values Iterable[Union[int, float, str]]

A sequence of the same length as values. Each value within new_values will replace the corresponding value in values (at the same index). New values can be any supported types but they should match the type of the columns.

required
regex

True if all the values and new_values are to be interpreted as regular expressions. Default: False. regex=True is only applicable to string columns.

False
output_column Optional[str]

An optional column to hold the result with the new values. Optional. If provided, if must have the same length as the columns sequence. The existing columns are unchanged, and new columns are created with the upper case values. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/basic.py
class ReplaceRule(BaseAssignColumnRule):
    """ Replaces some some values (or regular expressions) with another set of values (or regular expressions).

    Basic usage::

        # replaces A with new_A and b with new_b in col_A
        rule = ReplaceRule("col_A", values=["A", "b"], new_values=["new_A", "new_b"])
        rule.apply(data)

        # replaces 1 with 3 and 2 with 4 in the col_I column
        rule = ReplaceRule("col_I", values=[1, 2], new_values=[3, 4])
        rule.apply(data)

    Args:
        input_column (str): A column with the input values.
        values: A sequence of values to replace. Regular expressions can be used to match values more widely,
            in which case, the regex parameter must be set to True.
            Values can be any supported types but they should match the type of the columns.
        new_values: A sequence of the same length as values. Each value within new_values will replace the
            corresponding value in values (at the same index).
            New values can be any supported types but they should match the type of the columns.
        regex: True if all the values and new_values are to be interpreted as regular expressions. Default: False.
            regex=True is only applicable to string columns.
        output_column (Optional[str]): An optional column to hold the result with the new values.
            Optional. If provided, if must have the same length as the columns sequence.
            The existing columns are unchanged, and new columns are created with the upper case values.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    def __init__(self, input_column: str, values: Iterable[Union[int,float,str]], new_values: Iterable[Union[int,float,str]], regex=False, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        self.values = [val for val in values]
        self.new_values = [val for val in new_values]
        assert len(self.values) == len(self.new_values), "values and new_values must be of the same length."
        assert self.values, "values must not be empty."
        self.regex = regex
        if self.regex:
            assert all(isinstance(val, str) for val in self.values)
            assert all(isinstance(val, str) for val in self.new_values)

    def do_apply(self, df, col):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
RulesBlock (UnaryOpBaseRule)

Groups rules into encapsulated blocks or units of rules that achieve one thing. Blocks are reusable and encapsulated to reduce complexity.

Parameters:

Name Type Description Default
rules Iterable[etlrules.rule.BaseRule]

An iterable of rules which are part of this block. The first rule in the block will take its input from the named_input of the RulesBlock (if any, if not from the main output of the previous rule). The last rule in the block will publish the output as the named_output of the RulesBlock (if any, or the main output of the block). Any named outputs in the block are not exposed to the rules outside of the block (proper encapsulation).

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True
Source code in etlrules/backends/common/basic.py
class RulesBlock(UnaryOpBaseRule):
    """ Groups rules into encapsulated blocks or units of rules that achieve one thing.
    Blocks are reusable and encapsulated to reduce complexity.

    Args:
        rules: An iterable of rules which are part of this block.
            The first rule in the block will take its input from the named_input of the RulesBlock (if any, if not from the main output of the previous rule).
            The last rule in the block will publish the output as the named_output of the RulesBlock (if any, or the main output of the block).
            Any named outputs in the block are not exposed to the rules outside of the block (proper encapsulation).

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True
    """

    def __init__(self, rules: Iterable[BaseRule], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        self._rules = [rule for rule in rules]
        assert self._rules, "RulesBlock: Empty rules set provided."
        assert all(isinstance(rule, BaseRule) for rule in self._rules), [rule for rule in self._rules if not isinstance(rule, BaseRule)]
        assert self._rules[0].named_input is None, "First rule in a RulesBlock must consume the main input/output"
        assert self._rules[-1].named_input is None, "Last rule in a RulesBlock must produce the main output"
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)

    def apply(self, data):
        super().apply(data)
        data2 = RuleData(
            main_input=self._get_input_df(data),
            named_inputs={k: v for k, v in data.get_named_outputs()},
            strict=self.strict
        )
        for rule in self._rules:
            rule.apply(data2)
        self._set_output_df(data, data2.get_main_output())

    def to_dict(self) -> dict:
        dct = super().to_dict()
        dct[self.__class__.__name__]["rules"] = [rule.to_dict() for rule in self._rules]
        return dct

    @classmethod
    def from_dict(cls, dct, backend, additional_packages: Optional[Sequence[str]]=None) -> 'RulesBlock':
        dct = dct["RulesBlock"]
        rules = [BaseRule.from_dict(rule, backend, additional_packages) for rule in dct.get("rules", ())]
        kwargs = {k: v for k, v in dct.items() if k != "rules"}
        return cls(rules=rules, **kwargs)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/basic.py
def apply(self, data):
    super().apply(data)
    data2 = RuleData(
        main_input=self._get_input_df(data),
        named_inputs={k: v for k, v in data.get_named_outputs()},
        strict=self.strict
    )
    for rule in self._rules:
        rule.apply(data2)
    self._set_output_df(data, data2.get_main_output())
from_dict(dct, backend, additional_packages=None) classmethod

Creates a rule instance from a python dictionary.

Parameters:

Name Type Description Default
dct

A dictionary to create the plan from

required
backend

One of the supported backends (ie pandas)

required
additional_packages Optional[Sequence[str]]

Optional list of other packages to look for rules in

None

Returns:

Type Description
RulesBlock

A new instance of a Plan.

Source code in etlrules/backends/common/basic.py
@classmethod
def from_dict(cls, dct, backend, additional_packages: Optional[Sequence[str]]=None) -> 'RulesBlock':
    dct = dct["RulesBlock"]
    rules = [BaseRule.from_dict(rule, backend, additional_packages) for rule in dct.get("rules", ())]
    kwargs = {k: v for k, v in dct.items() if k != "rules"}
    return cls(rules=rules, **kwargs)
to_dict(self)

Serializes this rule to a python dictionary.

This is a generic implementation that should work for all derived classes and therefore you shouldn't need to override, although you can do so.

Because it aims to be generic and work correctly for all the derived classes, a few assumptions are made and must be respected when you implement your own rules derived from BaseRule.

The class will serialize all the data attributes of a class which do not start with underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member of the class. As such, to exclude any of your internal data attributes, either named them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.

The serialize will look into a classes dict and therefore the class must have a dict.

For the de-serialization to work generically, the name of the attributes must match the names of the arguments in the init. This is quite an important and restrictive constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.

Note

Use the same name for attributes on self as the respective arguments in init.

Source code in etlrules/backends/common/basic.py
def to_dict(self) -> dict:
    dct = super().to_dict()
    dct[self.__class__.__name__]["rules"] = [rule.to_dict() for rule in self._rules]
    return dct
SortRule (UnaryOpBaseRule)

Sort the input dataframe by the given columns, either ascending or descending.

Parameters:

Name Type Description Default
sort_by Iterable[str]

Either a single column speified as a string or a list or tuple of columns to sort by

required
ascending Union[bool, Iterable[bool]]

Whether to sort ascending or descending. Boolean. Default: True

True
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised when a column in the sort_by doesn't exist in the input dataframe.

Note

When multiple columns are specified, the first column decides the sort order. For any rows that have the same value in the first column, the second column is used to decide the sort order within that group and so on.

Source code in etlrules/backends/common/basic.py
class SortRule(UnaryOpBaseRule):
    """ Sort the input dataframe by the given columns, either ascending or descending.

    Args:
        sort_by: Either a single column speified as a string or a list or tuple of columns to sort by
        ascending: Whether to sort ascending or descending. Boolean. Default: True

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised when a column in the sort_by doesn't exist in the input dataframe.

    Note:
        When multiple columns are specified, the first column decides the sort order.
        For any rows that have the same value in the first column, the second column is used to decide the sort order within that group and so on.
    """

    def __init__(self, sort_by: Iterable[str], ascending: Union[bool,Iterable[bool]]=True, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        if isinstance(sort_by, str):
            self.sort_by = [sort_by]
        else:
            self.sort_by = [s for s in sort_by]
        assert isinstance(ascending, bool) or (isinstance(ascending, (list, tuple)) and all(isinstance(val, bool) for val in ascending) and len(ascending) == len(self.sort_by)), "ascending must be a bool or a list of bool of the same len as sort_by"
        self.ascending = ascending

    def do_sort(self, df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        if not set(self.sort_by) <= set(df.columns):
            raise MissingColumnError(f"Column(s) {set(self.sort_by) - set(df.columns)} are missing from the input dataframe.")
        df = self.do_sort(df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/basic.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    if not set(self.sort_by) <= set(df.columns):
        raise MissingColumnError(f"Column(s) {set(self.sort_by) - set(df.columns)} are missing from the input dataframe.")
    df = self.do_sort(df)
    self._set_output_df(data, df)

concat

HConcatRule (BinaryOpBaseRule)

Horizontally concatenates two dataframe with the result having the columns from the left dataframe followed by the columns from the right dataframe.

The columns from the left dataframe will be followed by the columns from the right dataframe in the result dataframe. The two dataframes must not have columns with the same name.

Example::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Left dataframe:
| A   | B  |
| a   | 1  |
| b   | 2  |
| c   | 3  |

Right dataframe:
| C   | D  |
| d   | 4  |
| e   | 5  |
| f   | 6  |

After a concat(left, right), the result will look like::

1
2
3
4
| A   | B  | C   | D  |
| a   | 1  | d   | 4  |
| b   | 2  | e   | 5  |
| c   | 3  | f   | 6  |

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised if the two dataframes have columns with the same name.

SchemaError

raised in strict mode only if the two dataframes have different number of rows.

Source code in etlrules/backends/common/concat.py
class HConcatRule(BinaryOpBaseRule):
    """ Horizontally concatenates two dataframe with the result having the columns from the left dataframe followed by the columns from the right dataframe.

    The columns from the left dataframe will be followed by the columns from the right dataframe in the result dataframe.
    The two dataframes must not have columns with the same name.

    Example::

        Left dataframe:
        | A   | B  |
        | a   | 1  |
        | b   | 2  |
        | c   | 3  |

        Right dataframe:
        | C   | D  |
        | d   | 4  |
        | e   | 5  |
        | f   | 6  |  

    After a concat(left, right), the result will look like::

        | A   | B  | C   | D  |
        | a   | 1  | d   | 4  |
        | b   | 2  | e   | 5  |
        | c   | 3  | f   | 6  |

    Args:
        named_input_left: Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right: Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.

        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised if the two dataframes have columns with the same name.
        SchemaError: raised in strict mode only if the two dataframes have different number of rows.
    """

    def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        # This __init__ not really needed but the type annotations are extracted from it
        super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)

    def do_concat(self, left_df, right_df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        left_df = self._get_input_df_left(data)
        right_df = self._get_input_df_right(data)
        overlapping_names = set(left_df.columns) & set(right_df.columns)
        if overlapping_names:
            raise ColumnAlreadyExistsError(f"Column(s) {overlapping_names} exist in both dataframes.")
        if self.strict:
            if len(left_df) != len(right_df):
                raise SchemaError(f"HConcat needs the two dataframe to have the same number of rows. left df={len(left_df)} rows, right df={len(right_df)} rows.")
        df = self.do_concat(left_df, right_df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/concat.py
def apply(self, data):
    super().apply(data)
    left_df = self._get_input_df_left(data)
    right_df = self._get_input_df_right(data)
    overlapping_names = set(left_df.columns) & set(right_df.columns)
    if overlapping_names:
        raise ColumnAlreadyExistsError(f"Column(s) {overlapping_names} exist in both dataframes.")
    if self.strict:
        if len(left_df) != len(right_df):
            raise SchemaError(f"HConcat needs the two dataframe to have the same number of rows. left df={len(left_df)} rows, right df={len(right_df)} rows.")
    df = self.do_concat(left_df, right_df)
    self._set_output_df(data, df)
VConcatRule (BinaryOpBaseRule)

Vertically concatenates two dataframe with the result having the rows from the left dataframe followed by the rows from the right dataframe.

The rows of the right dataframe are added at the bottom of the rows from the left dataframe in the result dataframe.

Example::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Left dataframe:
| A   | B  |
| a   | 1  |
| b   | 2  |
| c   | 3  |

Right dataframe:
| A   | B  |
| d   | 4  |
| e   | 5  |
| f   | 6  |

After a concat(left, right), the result will look like::

1
2
3
4
5
6
7
| A   | B  |
| a   | 1  |
| b   | 2  |
| c   | 3  |
| d   | 4  |
| e   | 5  |
| f   | 6  |

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
subset_columns Optional[Iterable[str]]

A subset list of columns available in both dataframes. Only these columns will be concated. The effect is similar to doing a ProjectRule(subset_columns) on both dataframes before the concat.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if any subset columns specified are missing from any of the dataframe.

SchemaError

raised in strict mode only if the columns differ between the two dataframes and subset_columns is not specified.

Note

In strict mode, as described above, SchemaError is raised if the columns are not the same (names, types can be inferred). In non-strict mode, columns are not checked and values are filled with NA when missing.

Source code in etlrules/backends/common/concat.py
class VConcatRule(BinaryOpBaseRule):
    """ Vertically concatenates two dataframe with the result having the rows from the left dataframe followed by the rows from the right dataframe.

    The rows of the right dataframe are added at the bottom of the rows from the left dataframe in the result dataframe.

    Example::

        Left dataframe:
        | A   | B  |
        | a   | 1  |
        | b   | 2  |
        | c   | 3  |

        Right dataframe:
        | A   | B  |
        | d   | 4  |
        | e   | 5  |
        | f   | 6  |  

    After a concat(left, right), the result will look like::

        | A   | B  |
        | a   | 1  |
        | b   | 2  |
        | c   | 3  |
        | d   | 4  |
        | e   | 5  |
        | f   | 6  |

    Args:
        named_input_left: Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right: Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        subset_columns: A subset list of columns available in both dataframes.
            Only these columns will be concated.
            The effect is similar to doing a ProjectRule(subset_columns) on both dataframes before the concat.

        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any subset columns specified are missing from any of the dataframe.
        SchemaError: raised in strict mode only if the columns differ between the two dataframes and subset_columns is not specified.

    Note:
        In strict mode, as described above, SchemaError is raised if the columns are not the same (names, types can be inferred).
        In non-strict mode, columns are not checked and values are filled with NA when missing.
    """

    def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], subset_columns: Optional[Iterable[str]]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)
        self.subset_columns = [col for col in subset_columns] if subset_columns is not None else None

    def do_concat(self, left_df, right_df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        left_df = self._get_input_df_left(data)
        right_df = self._get_input_df_right(data)
        if self.subset_columns:
            if not set(self.subset_columns) <= set(left_df.columns):
                raise MissingColumnError(f"Missing columns in the left dataframe of the concat operation: {set(self.subset_columns) - set(left_df.columns)}")
            if not set(self.subset_columns) <= set(right_df.columns):
                raise MissingColumnError(f"Missing columns in the right dataframe of the concat operation: {set(self.subset_columns) - set(right_df.columns)}")
            left_df = left_df[self.subset_columns]
            right_df = right_df[self.subset_columns]
        if self.strict:
            if set(left_df.columns) != set(right_df.columns):
                raise SchemaError(f"VConcat needs both dataframe have the same schema. Missing columns in the right df: {set(right_df.columns) - set(left_df.columns)}. Missing columns in the left df: {set(left_df.columns) - set(right_df.columns)}")
        df = self.do_concat(left_df, right_df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/concat.py
def apply(self, data):
    super().apply(data)
    left_df = self._get_input_df_left(data)
    right_df = self._get_input_df_right(data)
    if self.subset_columns:
        if not set(self.subset_columns) <= set(left_df.columns):
            raise MissingColumnError(f"Missing columns in the left dataframe of the concat operation: {set(self.subset_columns) - set(left_df.columns)}")
        if not set(self.subset_columns) <= set(right_df.columns):
            raise MissingColumnError(f"Missing columns in the right dataframe of the concat operation: {set(self.subset_columns) - set(right_df.columns)}")
        left_df = left_df[self.subset_columns]
        right_df = right_df[self.subset_columns]
    if self.strict:
        if set(left_df.columns) != set(right_df.columns):
            raise SchemaError(f"VConcat needs both dataframe have the same schema. Missing columns in the right df: {set(right_df.columns) - set(left_df.columns)}. Missing columns in the left df: {set(left_df.columns) - set(right_df.columns)}")
    df = self.do_concat(left_df, right_df)
    self._set_output_df(data, df)

conditions

FilterRule (UnaryOpBaseRule)

Exclude rows based on a condition.

Example::

1
2
3
4
5
6
7
8
Given df:
| A   | B  |
| 1   | 2  |
| 5   | 3  |
| 3   | 4  |

rule = FilterRule("df['A'] > df['B']")
rule.apply(df)

Result::

1
2
| A   | B  |
| 5   | 3  |

Same example using discarded_matching_rows=True::

1
2
rule = FilterRule("df['A'] > df['B']", discard_matching_rows=True)
rule.apply(df)

Result::

1
2
3
| A   | B  |
| 1   | 2  |
| 3   | 4  |

Parameters:

Name Type Description Default
condition_expression str

An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.

required
discard_matching_rows bool

By default the rows matching the condition (ie where the condition is True) are kept, the rest of the rows being dropped from the result. Setting this parameter to True essentially inverts the condition, so the rows matching the condition are discarded and the rest of the rows kept. Default: False.

False
named_output_discarded Optional[str]

A named output for the records being discarded if those need to be kept for further processing. Default: None, which doesn't keep track of discarded records.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ExpressionSyntaxError

raised if the column expression has a Python syntax error.

TypeError

raised if an operation is not supported between the types involved

NameError

raised if an unknown variable is used

KeyError

raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])

Source code in etlrules/backends/common/conditions.py
class FilterRule(UnaryOpBaseRule):
    """ Exclude rows based on a condition.

    Example::

        Given df:
        | A   | B  |
        | 1   | 2  |
        | 5   | 3  |
        | 3   | 4  |

        rule = FilterRule("df['A'] > df['B']")
        rule.apply(df)

    Result::

        | A   | B  |
        | 5   | 3  |

    Same example using discarded_matching_rows=True::

        rule = FilterRule("df['A'] > df['B']", discard_matching_rows=True)
        rule.apply(df)

    Result::

        | A   | B  |
        | 1   | 2  |
        | 3   | 4  |

    Args:
        condition_expression: An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.
        discard_matching_rows: By default the rows matching the condition (ie where the condition is True) are kept, the rest of the
            rows being dropped from the result. Setting this parameter to True essentially inverts the condition, so the rows
            matching the condition are discarded and the rest of the rows kept. Default: False.
        named_output_discarded: A named output for the records being discarded if those need to be kept for further processing.
            Default: None, which doesn't keep track of discarded records.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ExpressionSyntaxError: raised if the column expression has a Python syntax error.
        TypeError: raised if an operation is not supported between the types involved
        NameError: raised if an unknown variable is used
        KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])
    """

    EXCLUDE_FROM_COMPARE = ('_condition_expression', )

    def __init__(self, condition_expression: str, discard_matching_rows: bool=False, named_output_discarded: Optional[str]=None,
                 named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert condition_expression, "condition_expression cannot be empty"
        self.condition_expression = condition_expression
        self.discard_matching_rows = discard_matching_rows
        self.named_output_discarded = named_output_discarded
        self._condition_expression = self.get_condition_expression()
IfThenElseRule (UnaryOpBaseRule)

Calculates the ouput based on a condition (If Cond is true Then use then_value Else use else_value).

Example::

1
2
3
4
5
6
7
8
Given df:
| A   | B  |
| 1   | 2  |
| 5   | 3  |
| 3   | 4  |

rule = IfThenElseRule("df['A'] > df['B']", output_column="C", then_value="A is greater", else_value="B is greater")
rule.apply(df)

Result::

1
2
3
4
| A   | B  | C            |
| 1   | 2  | B is greater |
| 5   | 3  | A is greater |
| 3   | 4  | B is greater |

Parameters:

Name Type Description Default
condition_expression str

An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.

required
then_value Union[int, float, bool, str]

The value to use if the condition is true.

None
then_column Optional[str]

Use the value from the then_column if the condition is true. One and only one of then_value and then_column can be used.

None
else_value Union[int, float, bool, str]

The value to use if the condition is false.

None
else_column Optional[str]

Use the value from the else_column if the condition is false. One and only one of the else_value and else_column can be used.

None
output_column str

The column name of the result column which will be added to the dataframe.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised in strict mode only if a column with the same name already exists in the dataframe.

ExpressionSyntaxError

raised if the column expression has a Python syntax error.

MissingColumnError

raised when then_column or else_column are used but they are missing from the input dataframe.

TypeError

raised if an operation is not supported between the types involved

NameError

raised if an unknown variable is used

KeyError

raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])

Source code in etlrules/backends/common/conditions.py
class IfThenElseRule(UnaryOpBaseRule):
    """ Calculates the ouput based on a condition (If Cond is true Then use then_value Else use else_value).

    Example::

        Given df:
        | A   | B  |
        | 1   | 2  |
        | 5   | 3  |
        | 3   | 4  |

        rule = IfThenElseRule("df['A'] > df['B']", output_column="C", then_value="A is greater", else_value="B is greater")
        rule.apply(df)

    Result::

        | A   | B  | C            |
        | 1   | 2  | B is greater |
        | 5   | 3  | A is greater |
        | 3   | 4  | B is greater |

    Args:
        condition_expression: An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.
        then_value: The value to use if the condition is true.
        then_column: Use the value from the then_column if the condition is true.
            One and only one of then_value and then_column can be used.
        else_value: The value to use if the condition is false.
        else_column: Use the value from the else_column if the condition is false.
            One and only one of the else_value and else_column can be used.
        output_column: The column name of the result column which will be added to the dataframe.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
        ExpressionSyntaxError: raised if the column expression has a Python syntax error.
        MissingColumnError: raised when then_column or else_column are used but they are missing from the input dataframe.
        TypeError: raised if an operation is not supported between the types involved
        NameError: raised if an unknown variable is used
        KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])
    """

    EXCLUDE_FROM_COMPARE = ('_condition_expression', )

    def __init__(self, condition_expression: str, output_column: str, then_value: Optional[Union[int,float,bool,str]]=None, then_column: Optional[str]=None,
                 else_value: Optional[Union[int,float,bool,str]]=None, else_column: Optional[str]=None, 
                 named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert bool(then_value is None) != bool(then_column is None), "One and only one of then_value and then_column can be specified."
        assert bool(else_value is None) != bool(else_column is None), "One and only one of else_value and else_column can be specified."
        assert condition_expression, "condition_expression cannot be empty"
        assert output_column, "output_column cannot be empty"
        self.condition_expression = condition_expression
        self.output_column = output_column
        self.then_value = then_value
        self.then_column = then_column
        self.else_value = else_value
        self.else_column = else_column
        self._condition_expression = self.get_condition_expression()

    def _validate_columns(self, df_columns):
        if self.strict and self.output_column in df_columns:
            raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")
        if self.then_column is not None and self.then_column not in df_columns:
            raise MissingColumnError(f"Column {self.then_column} is missing from the input dataframe.")
        if self.else_column is not None and self.else_column not in df_columns:
            raise MissingColumnError(f"Column {self.else_column} is missing from the input dataframe.")

datetime

DateTimeAddRule (BaseAssignColumnRule)

Adds a number of units (days, hours, minutes, etc.) to a datetime column.

Basic usage::

1
2
3
4
5
6
7
# adds 2 days the A column
rule = DateTimeAddRule("A", 2, "days")
rule.apply(data)

# adds 2 hours to the A column
rule = DateTimeAddRule("A", 2, "hours")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The name of a datetime column to add to.

required
unit_value Union[int,float,str]

The number of units to add to the datetime column. The unit_value can be negative, in which case this rule performs a substract.

A name of an existing column can be passed into unit_value, in which case, that column will be added to the input_column. If the column is a timedelta, it will be added as is, if it's a numeric column, then it will be interpreted based on the unit parameter (e.g. years/days/hours/etc.). In this case, if the column specified in the unit_value doesn't exist, MissingColumnError is raised.

required
unit str

Specifies what unit the unit_value is in. Supported values are: years, months, weeks, weekdays, days, hours, minutes, seconds, microseconds, nanoseconds. weekdays skips weekends (ie Saturdays and Sundays).

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column doesn't exist in the input dataframe.

MissingColumnError

raised if unit_value is a name of a column but it doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

ValueError

raised if unit_value is a column which is not a timedelta column and the unit parameter is not specified.

Note

In non-strict mode, missing columns or overwriting existing columns are ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeAddRule(BaseAssignColumnRule):
    """ Adds a number of units (days, hours, minutes, etc.) to a datetime column.

    Basic usage::

        # adds 2 days the A column
        rule = DateTimeAddRule("A", 2, "days")
        rule.apply(data)

        # adds 2 hours to the A column
        rule = DateTimeAddRule("A", 2, "hours")
        rule.apply(data)

    Args:
        input_column (str): The name of a datetime column to add to.
        unit_value (Union[int,float,str]): The number of units to add to the datetime column.
            The unit_value can be negative, in which case this rule performs a substract.

            A name of an existing column can be passed into unit_value, in which case, that
            column will be added to the input_column.
            If the column is a timedelta, it will be added as is, if it's a numeric column,
            then it will be interpreted based on the unit parameter (e.g. years/days/hours/etc.).
            In this case, if the column specified in the unit_value doesn't exist,
            MissingColumnError is raised.

        unit (str): Specifies what unit the unit_value is in. Supported values are:
            years, months, weeks, weekdays, days, hours, minutes, seconds, microseconds, nanoseconds.
            weekdays skips weekends (ie Saturdays and Sundays).

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
        MissingColumnError: raised if unit_value is a name of a column but it doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
        ValueError: raised if unit_value is a column which is not a timedelta column and the unit parameter is not specified.

    Note:
        In non-strict mode, missing columns or overwriting existing columns are ignored.
    """
    def __init__(self, input_column: str, unit_value: Union[int, float, str], 
                 unit: Optional[Literal["years", "months", "weeks", "weekdays", "days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds"]],
                 output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.unit_value = unit_value
        if not isinstance(self.unit_value, str):
            assert unit in DT_ARITHMETIC_UNITS, f"Unsupported unit: '{unit}'. It must be one of {DT_ARITHMETIC_UNITS}"
        self.unit = unit
DateTimeDiffRule (BaseAssignColumnRule)

Calculates the difference between two datetime columns, optionally extracting it in the specified unit.

Basic usage::

1
2
3
# calculates the A - B in days
rule = DateTimeDiffRule("A", "B", unit="days")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The name of a datetime column.

required
input_column2 str

The name of the second datetime column. The result will be input_column - input_column2

required
unit Optional[str]

If specified, it will extract the given component of the difference: years, months, days, hours, minutes, seconds, microseconds, nanoseconds.

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if either input_column or input_column2 don't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

For best results, round the datetime columns using one of the rounding rules before calculating the difference. Otherwise, this rule will tend to truncate/round down. For example: 2023-05-05 10:00:00 - 2023-05-04 10:00:01 will result in 0 days even though the difference is 23:59:59. In cases like this one, it might be preferable to round, in this case perhaps round to "day" using DateTimeRoundRule or DateTimeRoundDownRule. This will result in a 2023-05-05 00:00:00 - 2023-05-04 00:00:00 which results in 1 day.

Source code in etlrules/backends/common/datetime.py
class DateTimeDiffRule(BaseAssignColumnRule):
    """ Calculates the difference between two datetime columns, optionally extracting it in the specified unit.

    Basic usage::

        # calculates the A - B in days
        rule = DateTimeDiffRule("A", "B", unit="days")
        rule.apply(data)

    Args:
        input_column (str): The name of a datetime column.
        input_column2 (str): The name of the second datetime column.
            The result will be input_column - input_column2

        unit (Optional[str]): If specified, it will extract the given component of the difference:
            years, months, days, hours, minutes, seconds, microseconds, nanoseconds.

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if either input_column or input_column2 don't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        For best results, round the datetime columns using one of the rounding rules before
        calculating the difference. Otherwise, this rule will tend to truncate/round down.
        For example: 2023-05-05 10:00:00 - 2023-05-04 10:00:01 will result in 0 days even though
        the difference is 23:59:59. In cases like this one, it might be preferable to round, in this
        case perhaps round to "day" using DateTimeRoundRule or DateTimeRoundDownRule. This will result
        in a 2023-05-05 00:00:00 - 2023-05-04 00:00:00 which results in 1 day.

    """

    SUPPORTED_COMPONENTS = {
        "days", "hours", "minutes", "seconds",
        "microseconds", "nanoseconds", "total_seconds",
    }

    SIGN = -1

    EXCLUDE_FROM_SERIALIZE = ('unit_value', )

    def __init__(self, input_column: str, input_column2: str, 
                 unit: Optional[Literal["days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds", "total_seconds"]],
                 output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        assert input_column2 and isinstance(input_column2, str), "input_column2 must be a non-empty string representing the name of a column."
        assert unit is None or unit in self.SUPPORTED_COMPONENTS, f"unit must be None of one of: {self.SUPPORTED_COMPONENTS}"
        super().__init__(input_column=input_column, output_column=output_column,
                         named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.input_column2 = input_column2
        self.unit = unit
DateTimeExtractComponentRule (BaseAssignColumnRule)

Extract an individual component of a date/time (e.g. year, month, day, hour, etc.).

Basic usage::

1
2
3
# extracts the year component from col_A. E.g. 2023-05-05 10:00:00 will extract 2023
rule = DateTimeExtractComponentRule("col_A", component="year")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A datetime column to extract the given component from.

required
component str

The component of the datatime to extract from the datetime. When the component is one of (year, month, day, hour, minute, second, microsecond) then the extracted component will be an integer with the respective component of the datetime.

When component is weekday, the component will be an integer with the values 0-6, with Monday being 0 and Sunday 6.

When the component is weekday_name or month_name, the result column will be a string column with the names of the weekdays (e.g. Monday, Tuesday, etc.) or month names respectively (e.g. January, February, etc.). The names will be printed in the language specified in the locale parameter (or English as the default).

required
locale Optional[str]

An optional locale string applicable to weekday_name and month_name. When specified, the names will use the given locale to print the names in the given language. Default: en_US.utf8 will print the names in English. Use the command locale -a on your terminal on Unix systems to find your locale language code. Trying to set the locale to a value that doesn't appear under the locale -a output will fail with ValueError: Unsupported locale.

required
output_column Optional[str]

An optional column name to contain the result. If provided, if must have the same length as the columns sequence. The existing columns are unchanged, and new columns are created with the component extracted. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

ValueError

raised if a locale is specified which is not supported or available on the machine running the scripts.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeExtractComponentRule(BaseAssignColumnRule):
    """ Extract an individual component of a date/time (e.g. year, month, day, hour, etc.).

    Basic usage::

        # extracts the year component from col_A. E.g. 2023-05-05 10:00:00 will extract 2023
        rule = DateTimeExtractComponentRule("col_A", component="year")
        rule.apply(data)

    Args:
        input_column (str): A datetime column to extract the given component from.
        component: The component of the datatime to extract from the datetime.
            When the component is one of (year, month, day, hour, minute, second, microsecond) then
            the extracted component will be an integer with the respective component of the datetime.

            When component is weekday, the component will be an integer with the values 0-6, with
            Monday being 0 and Sunday 6.

            When the component is weekday_name or month_name, the result column will be a string
            column with the names of the weekdays (e.g. Monday, Tuesday, etc.) or month names
            respectively (e.g. January, February, etc.). The names will be printed in the language
            specified in the locale parameter (or English as the default).

        locale: An optional locale string applicable to weekday_name and month_name. When specified,
            the names will use the given locale to print the names in the given language.
            Default: en_US.utf8 will print the names in English.
            Use the command `locale -a` on your terminal on Unix systems to find your locale language code.
            Trying to set the locale to a value that doesn't appear under the `locale -a` output will fail
            with ValueError: Unsupported locale.
        output_column (Optional[str]): An optional column name to contain the result.
            If provided, if must have the same length as the columns sequence.
            The existing columns are unchanged, and new columns are created with the component extracted.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
        ValueError: raised if a locale is specified which is not supported or available on the machine running the scripts.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    SUPPORTED_COMPONENTS = {
        "year", "month", "day", "hour", "minute", "second",
        "microsecond", "nanosecond",
        "weekday", "day_name", "month_name",
    }

    def __init__(self, input_column: str, component: str, locale: Optional[str], output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        self.component = component
        assert self.component in self.SUPPORTED_COMPONENTS, f"Unsupported component={self.component}. Must be one of: {self.SUPPORTED_COMPONENTS}"
        self.locale = locale
        self._locale = self.locale
        if self.locale and self._cannot_set_locale(locale):
            if self.strict:
                raise ValueError(f"Unsupported locale: {locale}")
            self._locale = None

    def _cannot_set_locale(self, locale):
        return False
DateTimeLocalNowRule (UnaryOpBaseRule)

Adds a new column with the local date/time.

Basic usage::

1
2
rule = DateTimeLocalNowRule(output_column="LocalTimeNow")
rule.apply(data)

Parameters:

Name Type Description Default
output_column

The name of the column to be added to the dataframe. This column will be populated with the local date/time at the time of the call. The same value will be populated for all rows. The date/time populated is a "naive" datetime ie: doesn't have a timezone information.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the input dataframe.

Note

In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.

Source code in etlrules/backends/common/datetime.py
class DateTimeLocalNowRule(UnaryOpBaseRule):
    """ Adds a new column with the local date/time.

    Basic usage::

        rule = DateTimeLocalNowRule(output_column="LocalTimeNow")
        rule.apply(data)

    Args:
        output_column: The name of the column to be added to the dataframe.
            This column will be populated with the local date/time at the time of the call.
            The same value will be populated for all rows.
            The date/time populated is a "naive" datetime ie: doesn't have a timezone information.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the input dataframe.

    Note:
        In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
    """

    def __init__(self, output_column, named_input:Optional[str]=None, named_output:Optional[str]=None, name:Optional[str]=None, description:Optional[str]=None, strict:bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert output_column and isinstance(output_column, str)
        self.output_column = output_column
DateTimeRoundDownRule (BaseAssignColumnRule)

Rounds down (truncates) a set of datetime columns to the specified granularity (day, hour, minute, etc.).

Basic usage::

1
2
3
4
5
6
7
# rounds the A column to the nearest second
rule = DateTimeRoundDownRule("A", "second")
rule.apply(data)

# rounds the A column to days
rule = DateTimeRoundDownRule("A", "day")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The column name to round according to the unit specified.

required
unit str

Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc.

The supported units are: day: removes the hours/minutes/etc. hour: removes the minutes/seconds etc. minute: removes the seconds/etc. second: removes the milliseconds/etc. millisecond: removes the microseconds microsecond: removes nanoseconds (if any)

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeRoundDownRule(BaseAssignColumnRule):
    """ Rounds down (truncates) a set of datetime columns to the specified granularity (day, hour, minute, etc.).

    Basic usage::

        # rounds the A column to the nearest second
        rule = DateTimeRoundDownRule("A", "second")
        rule.apply(data)

        # rounds the A column to days
        rule = DateTimeRoundDownRule("A", "day")
        rule.apply(data)

    Args:
        input_column (str): The column name to round according to the unit specified.
        unit (str): Specifies the unit of rounding.
            That is: rounding to the nearest day, hour, minute, etc.

            The supported units are:
                day: removes the hours/minutes/etc.
                hour: removes the minutes/seconds etc.
                minute: removes the seconds/etc.
                second: removes the milliseconds/etc.
                millisecond: removes the microseconds
                microsecond: removes nanoseconds (if any)

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
        self.unit = unit
DateTimeRoundRule (BaseAssignColumnRule)

Rounds a set of datetime columns to the specified granularity (day, hour, minute, etc.).

Basic usage::

1
2
3
4
5
6
7
# rounds the A column to the nearest second
rule = DateTimeRoundRule("A", "second")
rule.apply(data)

# rounds the A column to days
rule = DateTimeRoundRule("A", "day")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The column name to round according to the unit specified.

required
unit str

Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc.

The supported units are: day: anything up to 12:00:00 rounds down to the current day, after that up to the next day hour: anything up to 30th minute rounds down to the current hour, after that up to the next hour minute: anything up to 30th second rounds down to the current minute, after that up to the next minute second: rounds to the nearest second (if the column has milliseconds) millisecond: rounds to the nearest millisecond (if the column has microseconds) microsecond: rounds to the nearest microsecond nanosecond: rounds to the nearest nanosecond

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeRoundRule(BaseAssignColumnRule):
    """ Rounds a set of datetime columns to the specified granularity (day, hour, minute, etc.).

    Basic usage::

        # rounds the A column to the nearest second
        rule = DateTimeRoundRule("A", "second")
        rule.apply(data)

        # rounds the A column to days
        rule = DateTimeRoundRule("A", "day")
        rule.apply(data)

    Args:
        input_column (str): The column name to round according to the unit specified.
        unit (str): Specifies the unit of rounding.
            That is: rounding to the nearest day, hour, minute, etc.

            The supported units are:
                day: anything up to 12:00:00 rounds down to the current day, after that up to the next day
                hour: anything up to 30th minute rounds down to the current hour, after that up to the next hour
                minute: anything up to 30th second rounds down to the current minute, after that up to the next minute
                second: rounds to the nearest second (if the column has milliseconds)
                millisecond: rounds to the nearest millisecond (if the column has microseconds)
                microsecond: rounds to the nearest microsecond
                nanosecond: rounds to the nearest nanosecond

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
        self.unit = unit
DateTimeRoundUpRule (BaseAssignColumnRule)

Rounds up a set of datetime columns to the specified granularity (day, hour, minute, etc.).

Basic usage::

1
2
3
4
5
6
7
# rounds the A column to the nearest second
rule = DateTimeRoundUpRule("A", "second")
rule.apply(data)

# rounds A column to days
rule = DateTimeRoundUpRule("A", "day")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The column name to round according to the unit specified.

required
unit str

Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc.

The supported units are: day: Rounds up to the next day if there are any hours/minutes/etc. hour: Rounds up to the next hour if there are any minutes/etc. minute: Rounds up to the next minute if there are any seconds/etc. second: Rounds up to the next second if there are any milliseconds/etc. millisecond: Rounds up to the next millisecond if there are any microseconds microsecond: Rounds up to the next microsecond if there are any nanoseconds

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeRoundUpRule(BaseAssignColumnRule):
    """ Rounds up a set of datetime columns to the specified granularity (day, hour, minute, etc.).

    Basic usage::

        # rounds the A column to the nearest second
        rule = DateTimeRoundUpRule("A", "second")
        rule.apply(data)

        # rounds A column to days
        rule = DateTimeRoundUpRule("A", "day")
        rule.apply(data)

    Args:
        input_column (str): The column name to round according to the unit specified.
        unit (str): Specifies the unit of rounding.
            That is: rounding to the nearest day, hour, minute, etc.

            The supported units are:
                day: Rounds up to the next day if there are any hours/minutes/etc.
                hour: Rounds up to the next hour if there are any minutes/etc.
                minute: Rounds up to the next minute if there are any seconds/etc.
                second: Rounds up to the next second if there are any milliseconds/etc.
                millisecond: Rounds up to the next millisecond if there are any microseconds
                microsecond: Rounds up to the next microsecond if there are any nanoseconds

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
        self.unit = unit
DateTimeSubstractRule (BaseAssignColumnRule)

Substracts a number of units (days, hours, minutes, etc.) from a datetime column.

Basic usage::

1
2
3
4
5
6
7
# substracts 2 days the A column
rule = DateTimeSubstractRule("A", 2, "days")
rule.apply(data)

# substracts 2 hours to the A column
rule = DateTimeSubstractRule("A", 2, "hours")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The name of a datetime column to add to.

required
unit_value Union[int,float,str]

The number of units to add to the datetime column. The unit_value can be negative, in which case this rule performs an addition.

A name of an existing column can be passed into unit_value, in which case, that column will be substracted from the input_column. If the column is a timedelta, it will be substracted as is, if it's a numeric column, then it will be interpreted based on the unit parameter (e.g. days/hours/etc.). In this case, if the column specified in the unit_value doesn't exist, MissingColumnError is raised.

required
unit str

Specifies what unit the unit_value is in. Supported values are: days, hours, minutes, seconds, microseconds, nanoseconds.

required
output_column Optional[str]

The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column doesn't exist in the input dataframe.

MissingColumnError

raised if unit_value is a name of a column but it doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, missing columns or overwriting existing columns are ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeSubstractRule(BaseAssignColumnRule):
    """ Substracts a number of units (days, hours, minutes, etc.) from a datetime column.

    Basic usage::

        # substracts 2 days the A column
        rule = DateTimeSubstractRule("A", 2, "days")
        rule.apply(data)

        # substracts 2 hours to the A column
        rule = DateTimeSubstractRule("A", 2, "hours")
        rule.apply(data)

    Args:
        input_column (str): The name of a datetime column to add to.
        unit_value (Union[int,float,str]): The number of units to add to the datetime column.
            The unit_value can be negative, in which case this rule performs an addition.

            A name of an existing column can be passed into unit_value, in which case, that
            column will be substracted from the input_column.
            If the column is a timedelta, it will be substracted as is, if it's a numeric column,
            then it will be interpreted based on the unit parameter (e.g. days/hours/etc.).
            In this case, if the column specified in the unit_value doesn't exist,
            MissingColumnError is raised.

        unit (str): Specifies what unit the unit_value is in. Supported values are:
            days, hours, minutes, seconds, microseconds, nanoseconds.

        output_column (Optional[str]): The name of a new column with the result. Optional.
            If not provided, the result is updated in place.
            In strict mode, if provided, the output_column must not exist in the input dataframe.
            In non-strict mode, if provided, the output_column with overwrite a column with
            the same name in the input dataframe (if any).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
        MissingColumnError: raised if unit_value is a name of a column but it doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, missing columns or overwriting existing columns are ignored.
    """
    def __init__(self, input_column: str, unit_value: Union[int, float, str], 
                 unit: Optional[Literal["years", "months", "weeks", "weekdays", "days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds"]],
                 output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.unit_value = unit_value
        if not isinstance(self.unit_value, str):
            assert unit in DT_ARITHMETIC_UNITS, f"Unsupported unit: '{unit}'. It must be one of {DT_ARITHMETIC_UNITS}"
        self.unit = unit
DateTimeToStrFormatRule (BaseAssignColumnRule)

Formats a datetime column to a string representation according to a specified format.

Basic usage::

1
2
3
# displays the dates in column col_A in the %Y-%m-%d format, e.g. 2023-05-19
rule = DateTimeToStrFormatRule("col_A", format="%Y-%m-%d")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The datetime column with the values to format to string.

required
format str

The format used to display the date/time. E.g. %Y-%m-%d For the directives accepted in the format, have a look at: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior

required
output_column Optional[str]

An optional column to hold the formatted results. If provided, the existing column is unchanged, and a new column with this new is created. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, overwriting existing columns is ignored.

Source code in etlrules/backends/common/datetime.py
class DateTimeToStrFormatRule(BaseAssignColumnRule):
    """ Formats a datetime column to a string representation according to a specified format.

    Basic usage::

        # displays the dates in column col_A in the %Y-%m-%d format, e.g. 2023-05-19
        rule = DateTimeToStrFormatRule("col_A", format="%Y-%m-%d")
        rule.apply(data)

    Args:
        input_column (str): The datetime column with the values to format to string.
        format: The format used to display the date/time.
            E.g. %Y-%m-%d
            For the directives accepted in the format, have a look at:
            https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
        output_column (Optional[str]): An optional column to hold the formatted results.
            If provided, the existing column is unchanged, and a new column with this new
            is created.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, overwriting existing columns is ignored.
    """

    def __init__(self, input_column: str, format: str, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        self.format = format
DateTimeUTCNowRule (UnaryOpBaseRule)

Adds a new column with the UTC date/time.

Basic usage::

1
2
rule = DateTimeUTCNowRule(output_column="UTCTimeNow")
rule.apply(data)

Parameters:

Name Type Description Default
output_column

The name of the column to be added to the dataframe. This column will be populated with the UTC date/time at the time of the call. The same value will be populated for all rows. The date/time populated is a "naive" datetime ie: doesn't have a timezone information.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the input dataframe.

Note

In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.

Source code in etlrules/backends/common/datetime.py
class DateTimeUTCNowRule(UnaryOpBaseRule):
    """ Adds a new column with the UTC date/time.

    Basic usage::

        rule = DateTimeUTCNowRule(output_column="UTCTimeNow")
        rule.apply(data)

    Args:
        output_column: The name of the column to be added to the dataframe.
            This column will be populated with the UTC date/time at the time of the call.
            The same value will be populated for all rows.
            The date/time populated is a "naive" datetime ie: doesn't have a timezone information.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the input dataframe.

    Note:
        In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
    """

    def __init__(self, output_column, named_input:Optional[str]=None, named_output:Optional[str]=None, name:Optional[str]=None, description:Optional[str]=None, strict:bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert output_column and isinstance(output_column, str)
        self.output_column = output_column

fill

BackFillRule (BaseFillRule)

Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.

Example::

1
2
3
4
| A   | B  |
| a   | NA |
| b   | 2  |
| a   | NA |

After a fill forward::

1
2
3
4
| A   | B  |
| a   | 2  |
| b   | 2  |
| a   | NA |

After a fill forward with group_by=["A"]::

1
2
3
4
| A   | B  |
| a   | NA |
| b   | 2  |
| a   | NA |

The "a" group has no non-NA value, so it is not filled. The "b" group has a non-NA value of 2 but not other NA values, so nothing to fill.

Parameters:

Name Type Description Default
columns Iterable[str]

The list of columns to replaces NAs for. The rest of the columns in the dataframe are not affected.

required
sort_by Optional[Iterable[str]]

The list of columns to sort by before the fill operation. Optional. Given the previous non-NA values are used, sorting can make a difference in the values uses.

required
sort_ascending bool

When sort_by is specified, True means sort ascending, False sort descending.

required
group_by Optional[Iterable[str]]

The list of columns to group by before the fill operation. Optional. The fill values are only used within a group, other adjacent groups are not filled. Useful when you want to copy(fill) data at a certain group level.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.

Source code in etlrules/backends/common/fill.py
class BackFillRule(BaseFillRule):
    """ Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.

    Example::

        | A   | B  |
        | a   | NA |
        | b   | 2  |
        | a   | NA |

    After a fill forward::

        | A   | B  |
        | a   | 2  |
        | b   | 2  |
        | a   | NA |  

    After a fill forward with group_by=["A"]::

        | A   | B  |
        | a   | NA |
        | b   | 2  |
        | a   | NA |

    The "a" group has no non-NA value, so it is not filled.
    The "b" group has a non-NA value of 2 but not other NA values, so nothing to fill.

    Args:
        columns (Iterable[str]): The list of columns to replaces NAs for.
            The rest of the columns in the dataframe are not affected.
        sort_by (Optional[Iterable[str]]): The list of columns to sort by before the fill operation. Optional.
            Given the previous non-NA values are used, sorting can make a difference in the values uses.
        sort_ascending (bool): When sort_by is specified, True means sort ascending, False sort descending.
        group_by (Optional[Iterable[str]]): The list of columns to group by before the fill operation. Optional.
            The fill values are only used within a group, other adjacent groups are not filled.
            Useful when you want to copy(fill) data at a certain group level.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.
    """
BaseFillRule (UnaryOpBaseRule)
Source code in etlrules/backends/common/fill.py
class BaseFillRule(UnaryOpBaseRule):

    FILL_METHOD = None

    def __init__(self, columns: Iterable[str], sort_by: Optional[Iterable[str]]=None, sort_ascending: bool=True, group_by: Optional[Iterable[str]]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        assert self.FILL_METHOD is not None
        assert columns, "Columns need to be specified for fill rules."
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.columns = [col for col in columns]
        assert all(isinstance(col, str) for col in self.columns), "All columns must be strings in fill rules."
        self.sort_by = sort_by
        if self.sort_by is not None:
            self.sort_by = [col for col in self.sort_by]
            assert all(isinstance(col, str) for col in self.sort_by), "All sort_by columns must be strings in fill rules when specified."
        self.sort_ascending = sort_ascending
        self.group_by = group_by
        if self.group_by is not None:
            self.group_by = [col for col in self.group_by]
            assert all(isinstance(col, str) for col in self.group_by), "All group_by columns must be strings in fill rules when specified."

    def do_apply(self, df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        df_columns = [col for col in df.columns]
        if self.sort_by:
            if not set(self.sort_by) <= set(df_columns):
                raise MissingColumnError(f"Missing sort_by column(s) in fill operation: {set(self.sort_by) - set(df_columns)}")
        if self.group_by:
            if not set(self.group_by) <= set(df_columns):
                raise MissingColumnError(f"Missing group_by column(s) in fill operation: {set(self.group_by) - set(df_columns)}")
        df = self.do_apply(df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/fill.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    df_columns = [col for col in df.columns]
    if self.sort_by:
        if not set(self.sort_by) <= set(df_columns):
            raise MissingColumnError(f"Missing sort_by column(s) in fill operation: {set(self.sort_by) - set(df_columns)}")
    if self.group_by:
        if not set(self.group_by) <= set(df_columns):
            raise MissingColumnError(f"Missing group_by column(s) in fill operation: {set(self.group_by) - set(df_columns)}")
    df = self.do_apply(df)
    self._set_output_df(data, df)
ForwardFillRule (BaseFillRule)

Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.

Example::

1
2
3
4
| A   | B  |
| a   | 1  |
| b   | NA |
| a   | NA |

After a fill forward::

1
2
3
4
| A   | B  |
| a   | 1  |
| b   | 1  |
| a   | 1  |

After a fill forward with group_by=["A"]::

1
2
3
4
| A   | B  |
| a   | 1  |
| b   | NA |
| a   | 1  |

The "a" group has the first non-NA value as 1 and that is used "forward" to fill the 3rd row. The "b" group has no non-NA values, so nothing to fill.

Parameters:

Name Type Description Default
columns Iterable[str]

The list of columns to replaces NAs for. The rest of the columns in the dataframe are not affected.

required
sort_by Optional[Iterable[str]]

The list of columns to sort by before the fill operation. Optional. Given the previous non-NA values are used, sorting can make a difference in the values uses.

required
sort_ascending bool

When sort_by is specified, True means sort ascending, False sort descending.

required
group_by Optional[Iterable[str]]

The list of columns to group by before the fill operation. Optional. The fill values are only used within a group, other adjacent groups are not filled. Useful when you want to copy(fill) data at a certain group level.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.

Source code in etlrules/backends/common/fill.py
class ForwardFillRule(BaseFillRule):
    """ Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.

    Example::

        | A   | B  |
        | a   | 1  |
        | b   | NA |
        | a   | NA |

    After a fill forward::

        | A   | B  |
        | a   | 1  |
        | b   | 1  |
        | a   | 1  |  

    After a fill forward with group_by=["A"]::

        | A   | B  |
        | a   | 1  |
        | b   | NA |
        | a   | 1  |

    The "a" group has the first non-NA value as 1 and that is used "forward" to fill the 3rd row.
    The "b" group has no non-NA values, so nothing to fill.

    Args:
        columns (Iterable[str]): The list of columns to replaces NAs for.
            The rest of the columns in the dataframe are not affected.
        sort_by (Optional[Iterable[str]]): The list of columns to sort by before the fill operation. Optional.
            Given the previous non-NA values are used, sorting can make a difference in the values uses.
        sort_ascending (bool): When sort_by is specified, True means sort ascending, False sort descending.
        group_by (Optional[Iterable[str]]): The list of columns to group by before the fill operation. Optional.
            The fill values are only used within a group, other adjacent groups are not filled.
            Useful when you want to copy(fill) data at a certain group level.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description Optional[str]: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.
    """

io special

db
ReadSQLQueryRule (BaseRule)

Runs a SQL query and reads the results back into a dataframe.

Basic usage::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# reads all the data from a sqlite db called mydb.db, from the table MyTable
# saves the dataframe as the main output of the rule which subsequent rules can use as their main input
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable")
rule.apply(data)

# reads all the data from a sqlite db called mydb.db, from the table MyTable
# saves the dataframe as the a named output called MyData which subsequent rules can use by name
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", named_output="MyData")
rule.apply(data)

# same as the first example, but uses column types rather than relying on type inferrence
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", column_types={"ColA": "int64", "ColB": "string"})
rule.apply(data)

Parameters:

Name Type Description Default
sql_engine str

A sqlalchemy engine string. This is typically in the form: dialect+driver://username:password@host:port/database For more information, please refer to the sqlalchemy documentation here: https://docs.sqlalchemy.org/en/20/core/engines.html

In order to support users and passwords in the sql_engine string, substitutions of environment variables is supported using the {env.VARIABLE_NAME} form. For example, adding the USER and PASSWORD environment variables in the sql string could be done as: sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective environment variables, allowing you to not hardcode them in the plan for security reasons but also for configurability. A similar substition can be achieved using the plan context using the context.property, e.g. sql_engine = "postgres://{context.USER}:{env.PASSWORD}@{context.DB_HOST}/mydb It's not recommended to store passwords in plain text in the plan.

required
sql_query str

A SQL SELECT statement that will specify the columns, table and optionally any WHERE, GROUP BY, ORDER BY clauses. The SQL statement must be valid for the SQL engine specified in the sql_engine parameter.

The env and context substitution work in the sql_query too. E.g.: SELECT * from {env.SCHEMA}.{context.TABLE_NAME} WHERE {context.FILTER} This allows you to parameterize the plan at run time.

required
column_types Optional[Mapping[str, str]]

A mapping of column names and their types. Column types are inferred from the data when this parameter is not specified. For empty result sets, this inferrence is not possible, so specifying the column types allows the users to control the types in that scenario and not fallback onto backends defaults.

None
batch_size int

An optional batch size (number of rows) to use when reading the results. Defaults: 50000. Some backends ignore this option, otherwise use it to partition the data.

50000
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
SQLError

raised if there's an error running the sql statement.

UnsupportedTypeError

raised if column_types are specified and any of them are not supported.

Note

The implementation uses sqlalchemy, which must be installed as an optional dependency of etlrules.

Source code in etlrules/backends/common/io/db.py
class ReadSQLQueryRule(BaseRule):
    """ Runs a SQL query and reads the results back into a dataframe.

    Basic usage::

        # reads all the data from a sqlite db called mydb.db, from the table MyTable
        # saves the dataframe as the main output of the rule which subsequent rules can use as their main input
        rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable")
        rule.apply(data)

        # reads all the data from a sqlite db called mydb.db, from the table MyTable
        # saves the dataframe as the a named output called MyData which subsequent rules can use by name
        rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", named_output="MyData")
        rule.apply(data)

        # same as the first example, but uses column types rather than relying on type inferrence
        rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", column_types={"ColA": "int64", "ColB": "string"})
        rule.apply(data)

    Args:
        sql_engine: A sqlalchemy engine string. This is typically in the form:
            dialect+driver://username:password@host:port/database
            For more information, please refer to the sqlalchemy documentation here:
            https://docs.sqlalchemy.org/en/20/core/engines.html

            In order to support users and passwords in the sql_engine string, substitutions of environment variables
            is supported using the {env.VARIABLE_NAME} form.
            For example, adding the USER and PASSWORD environment variables in the sql string could be done as:
                sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb
            In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective
            environment variables, allowing you to not hardcode them in the plan for security reasons but also for
            configurability.
            A similar substition can be achieved using the plan context using the context.property, e.g.
                sql_engine = "postgres://{context.USER}:{env.PASSWORD}@{context.DB_HOST}/mydb
            It's not recommended to store passwords in plain text in the plan.
        sql_query: A SQL SELECT statement that will specify the columns, table and optionally any WHERE, GROUP BY, ORDER BY clauses.
            The SQL statement must be valid for the SQL engine specified in the sql_engine parameter.

            The env and context substitution work in the sql_query too. E.g.:
                SELECT * from {env.SCHEMA}.{context.TABLE_NAME} WHERE {context.FILTER}
            This allows you to parameterize the plan at run time.
        column_types: A mapping of column names and their types. Column types are inferred from the data when this parameter
            is not specified. For empty result sets, this inferrence is not possible, so specifying the column types allows
            the users to control the types in that scenario and not fallback onto backends defaults. 
        batch_size: An optional batch size (number of rows) to use when reading the results. Defaults: 50000.
            Some backends ignore this option, otherwise use it to partition the data.

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        SQLError: raised if there's an error running the sql statement.
        UnsupportedTypeError: raised if column_types are specified and any of them are not supported.

    Note:
        The implementation uses sqlalchemy, which must be installed as an optional dependency of etlrules.
    """

    def __init__(self, sql_engine: str, sql_query: str, column_types: Optional[Mapping[str, str]]=None, batch_size: int=50_000, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_output=named_output, name=name, description=description, strict=strict)
        self.sql_engine = sql_engine
        self.sql_query = sql_query
        if not self.sql_engine or not isinstance(self.sql_engine, str):
            raise ValueError("The sql_engine parameter must be a non-empty string.")
        if not self.sql_query or not isinstance(self.sql_query, str):
            raise ValueError("The sql_query parameter must be a non-empty string.")
        self.column_types = column_types
        self._validate_column_types()
        self.batch_size = batch_size

    def _validate_column_types(self):
        if self.column_types is not None:
            for column, column_type in self.column_types.items():
                if column_type not in SUPPORTED_TYPES:
                    raise UnsupportedTypeError(f"Type '{column_type}' for column '{column}' is not supported.")

    def has_input(self):
        return False

    def _do_apply(self, connection):
        raise NotImplementedError("Can't instantiate base class.")

    def _get_sql_engine(self) -> str:
        sql_engine = subst_string(self.sql_engine)
        if not sql_engine:
            raise ValueError("The sql_engine parameter must be a non-empty string.")
        return sql_engine

    def _get_sql_query(self) -> str:
        sql_query = subst_string(self.sql_query)
        if not sql_query:
            raise ValueError("The sql_query parameter must be a non-empty string.")
        return sql_query

    def apply(self, data):
        super().apply(data)
        sql_engine = self._get_sql_engine()
        engine = SQLAlchemyEngines.get_engine(sql_engine)
        with engine.connect() as connection:
            try:
                result = self._do_apply(connection)
            except sa.exc.SQLAlchemyError as exc:
                raise SQLError(str(exc))
        self._set_output_df(data, result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/io/db.py
def apply(self, data):
    super().apply(data)
    sql_engine = self._get_sql_engine()
    engine = SQLAlchemyEngines.get_engine(sql_engine)
    with engine.connect() as connection:
        try:
            result = self._do_apply(connection)
        except sa.exc.SQLAlchemyError as exc:
            raise SQLError(str(exc))
    self._set_output_df(data, result)
has_input(self)

Returns True if the rule needs a dataframe input to operate on, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.

Source code in etlrules/backends/common/io/db.py
def has_input(self):
    return False
WriteSQLTableRule (UnaryOpBaseRule)

Writes the data from the input dataframe into a SQL table in a database.

The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to the DB. If the named_input is specified, the input with that name is written, otherwise, it takes the main output of the preceding rule.

Basic usage::

1
2
3
4
5
6
7
8
9
# writes the main input to a table called MyTable in a sqlite DB called mydb.db
# If the table already exists, it replaces it
rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="replace")
rule.apply(data)

# writes the dataframe input called 'input_data' to a table MyTable in a sqlite DB mydb.db
# If the table already exists, it appends the data to it
rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="append", named_input="input_data")
rule.apply(data)

Parameters:

Name Type Description Default
sql_engine str

A sqlalchemy engine string. This is typically in the form: dialect+driver://username:password@host:port/database For more information, please refer to the sqlalchemy documentation here: https://docs.sqlalchemy.org/en/20/core/engines.html

In order to support users and passwords in the sql_engine string, substitutions of environment variables is supported using the {env.VARIABLE_NAME} form. For example, adding the USER and PASSWORD environment variables in the sql string could be done as: sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective environment variables, allowing you to not hardcode them in the plan for security reasons but also for configurability.

required
sql_table str

The name of the sql table to write to.

required
if_exists str

Specifies what to do in case the table already exists in the database. The options are: - replace: drops all the existing data and inserts the data in the input dataframe - append: adds the data in the input dataframe to the existing data in the table - fail: Raises a ValueError exception Default: fail.

'fail'
named_input Optional[str]

Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True.

True

Exceptions:

Type Description
ValueError

raised if the table already exists when the if_exists is fail. ValueError is also raised if any of the arguments passed into the rule are not strings or empty strings.

SQLError

raised if there's any problem writing the data into the database. For example: If the schema doesn't match the schema of the table written to (for existing tables).

Source code in etlrules/backends/common/io/db.py
class WriteSQLTableRule(UnaryOpBaseRule):
    """ Writes the data from the input dataframe into a SQL table in a database.

    The rule is a final rule, which means it produces no additional outputs, it takes any of the existing
    outputs and writes it to the DB. If the named_input is specified, the input with that name
    is written, otherwise, it takes the main output of the preceding rule.

    Basic usage::

        # writes the main input to a table called MyTable in a sqlite DB called mydb.db
        # If the table already exists, it replaces it
        rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="replace")
        rule.apply(data)

        # writes the dataframe input called 'input_data' to a table MyTable in a sqlite DB mydb.db
        # If the table already exists, it appends the data to it
        rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="append", named_input="input_data")
        rule.apply(data)

    Args:
        sql_engine: A sqlalchemy engine string. This is typically in the form:
            dialect+driver://username:password@host:port/database
            For more information, please refer to the sqlalchemy documentation here:
            https://docs.sqlalchemy.org/en/20/core/engines.html

            In order to support users and passwords in the sql_engine string, substitutions of environment variables
            is supported using the {env.VARIABLE_NAME} form.
            For example, adding the USER and PASSWORD environment variables in the sql string could be done as:
                sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb
            In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective
            environment variables, allowing you to not hardcode them in the plan for security reasons but also for
            configurability. 
        sql_table: The name of the sql table to write to.
        if_exists: Specifies what to do in case the table already exists in the database.
            The options are:
                - replace: drops all the existing data and inserts the data in the input dataframe
                - append: adds the data in the input dataframe to the existing data in the table
                - fail: Raises a ValueError exception
            Default: fail.

        named_input (Optional[str]): Select by name the dataframe to write from the input data.
            Optional. When not specified, the main output of the previous rule will be written.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True.

    Raises:
        ValueError: raised if the table already exists when the if_exists is fail.
            ValueError is also raised if any of the arguments passed into the rule are not strings or empty strings.
        SQLError: raised if there's any problem writing the data into the database.
            For example: If the schema doesn't match the schema of the table written to (for existing tables).
    """

    class IF_EXISTS_OPTIONS:
        APPEND = 'append'
        REPLACE = 'replace'
        FAIL = 'fail'

    ALL_IF_EXISTS_OPTIONS = {IF_EXISTS_OPTIONS.APPEND, IF_EXISTS_OPTIONS.REPLACE, IF_EXISTS_OPTIONS.FAIL}

    EXCLUDE_FROM_SERIALIZE = ("named_output", )

    def __init__(self, sql_engine: str, sql_table: str, if_exists: str='fail', named_input=None, name=None, description=None, strict=True):
        super().__init__(named_input=named_input, named_output=None, name=name, description=description, strict=strict)
        self.sql_engine = sql_engine
        if not self.sql_engine or not isinstance(self.sql_engine, str):
            raise ValueError("The sql_engine parameter must be a non-empty string.")
        self.sql_table = sql_table
        if not self.sql_table or not isinstance(self.sql_table, str):
            raise ValueError("The sql_table parameter must be a non-empty string.")
        self.if_exists = if_exists
        if self.if_exists not in self.ALL_IF_EXISTS_OPTIONS:
            raise ValueError(f"'{if_exists}' is not a valid value for the if_exists parameter. It must be one of: '{self.ALL_IF_EXISTS_OPTIONS}'")

    def has_output(self):
        return False

    def _get_sql_engine(self) -> str:
        sql_engine = subst_string(self.sql_engine)
        if not sql_engine:
            raise ValueError("The sql_engine parameter must be a non-empty string.")
        return sql_engine

    def _get_sql_table(self) -> str:
        sql_table = subst_string(self.sql_table)
        if not sql_table:
            raise ValueError("The sql_table parameter must be a non-empty string.")
        return sql_table
has_output(self)

Returns True if the rule produces a dataframe, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.

Source code in etlrules/backends/common/io/db.py
def has_output(self):
    return False
files
BaseReadFileRule (BaseRule)
Source code in etlrules/backends/common/io/files.py
class BaseReadFileRule(BaseRule):
    def __init__(self, file_name: str, file_dir: Optional[str]=None, regex: bool=False, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_output=named_output, name=name, description=description, strict=strict)
        self.file_name = file_name
        self.file_dir = file_dir
        self.regex = bool(regex)
        if self._is_uri() and self.regex:
            raise ValueError("Regex read not supported for URIs.")

    def _is_uri(self):
        file_name = self.file_name.lower()
        return file_name.startswith("http://") or file_name.startswith("https://")

    def has_input(self):
        return False

    def _get_full_file_paths(self):
        file_name = subst_string(self.file_name)
        file_dir = subst_string(self.file_dir or "")
        if self.regex:
            pattern = re.compile(file_name)
            for fn in os.listdir(file_dir):
                if pattern.match(fn):
                    yield os.path.join(file_dir, fn)
        else:
            if self._is_uri():
                yield file_name
            else:
                yield os.path.join(file_dir, file_name)

    def do_read(self, file_path: str):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def do_concat(self, left_df, right_df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)

        result = None
        for file_path in self._get_full_file_paths():
            df = self.do_read(file_path)
            if result is None:
                result = df
            else:
                result = self.do_concat(result, df)
        self._set_output_df(data, result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/io/files.py
def apply(self, data):
    super().apply(data)

    result = None
    for file_path in self._get_full_file_paths():
        df = self.do_read(file_path)
        if result is None:
            result = df
        else:
            result = self.do_concat(result, df)
    self._set_output_df(data, result)
has_input(self)

Returns True if the rule needs a dataframe input to operate on, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.

Source code in etlrules/backends/common/io/files.py
def has_input(self):
    return False
BaseWriteFileRule (UnaryOpBaseRule)
Source code in etlrules/backends/common/io/files.py
class BaseWriteFileRule(UnaryOpBaseRule):

    EXCLUDE_FROM_SERIALIZE = ("named_output", )

    def __init__(self, file_name, file_dir=".", named_input=None, name=None, description=None, strict=True):
        super().__init__(named_input=named_input, named_output=None, name=name, description=description, strict=strict)
        self.file_name = file_name
        self.file_dir = file_dir

    def has_output(self):
        return False

    def do_write(self, file_name: str, file_dir: str, df) -> None:
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        self.do_write(subst_string(self.file_name), subst_string(self.file_dir), df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/io/files.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    self.do_write(subst_string(self.file_name), subst_string(self.file_dir), df)
has_output(self)

Returns True if the rule produces a dataframe, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.

Source code in etlrules/backends/common/io/files.py
def has_output(self):
    return False
ReadCSVFileRule (BaseReadFileRule)

Reads one or multiple csv files from a directory and persists it as a dataframe for subsequent rules to operate on.

Basic usage::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# reads a file data.csv and persists it as the main output of the rule
rule = ReadCSVFileRule("data.csv", "/home/myuser/")
rule.apply(data)

# reads a file test_data.csv and persists it as the input_data named output
# other rules can specify input_data as their named_input to operate on it
rule = ReadCSVFileRule("test_data.csv", "/home/myuser/", named_output="input_data")
rule.apply(data)

# extracts all files starting with data followed by 4 digits and concatenate them
# e.g. data1234.csv, data5678.csv, etc.
rule = ReadCSVFileRule("data[0-9]{4}.csv", "/home/myuser/", regex=True, named_output="input_data")
rule.apply(data)

Parameters:

Name Type Description Default
file_name str

The name of the csv file to load. The format will be inferred from the extension of the file. A simple text csv file will be inferred from the .csv extension. The extensions like .zip, .gz, .bz2, .xz will extract a single compressed csv file from the given input compressed file. file_name can also be a regular expression (specify regex=True in that case). The reader will find all the files in the file_dir directory that match the regular expression and extract all those csv file and concatenate them into a single dataframe. For example, file_name=".*.csv", file_dir=".", regex=True will extract all the files with the .csv extension from the current directory. It can also be an URI (e.g. https://example.com/mycsv.csv)

required
file_dir Optional[str]

The file directory where the file_name is located. When file_name is a regular expression and the regex parameter is True, file_dir is the directory that is inspected for any files that match the regular expression. Optional. For files it defaults to . (ie the current directory). Ignored for URIs.

None
regex bool

When True, the file_name is interpreted as a regular expression. Defaults to False.

False
separator str

The single character to be used as separator in the csv file. Defaults to , (comma).

','
header bool

When True, the first line is interpreted as the header and the column names are extracted from it. When False, the first line is part of the data and the columns will have names like 0, 1, 2, etc. Defaults to True.

True
skip_header_rows Optional[int]

Optional number of rows to skip at the top of the file, before the header.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
IOError

raised when the file is not found.

Source code in etlrules/backends/common/io/files.py
class ReadCSVFileRule(BaseReadFileRule):
    r""" Reads one or multiple csv files from a directory and persists it as a dataframe for subsequent rules to operate on.

    Basic usage::

        # reads a file data.csv and persists it as the main output of the rule
        rule = ReadCSVFileRule("data.csv", "/home/myuser/")
        rule.apply(data)

        # reads a file test_data.csv and persists it as the input_data named output
        # other rules can specify input_data as their named_input to operate on it
        rule = ReadCSVFileRule("test_data.csv", "/home/myuser/", named_output="input_data")
        rule.apply(data)

        # extracts all files starting with data followed by 4 digits and concatenate them
        # e.g. data1234.csv, data5678.csv, etc.
        rule = ReadCSVFileRule("data[0-9]{4}.csv", "/home/myuser/", regex=True, named_output="input_data")
        rule.apply(data)

    Args:
        file_name: The name of the csv file to load. The format will be inferred from the extension of the file.
            A simple text csv file will be inferred from the .csv extension. The extensions like .zip, .gz, .bz2, .xz
            will extract a single compressed csv file from the given input compressed file.
            file_name can also be a regular expression (specify regex=True in that case).
            The reader will find all the files in the file_dir directory that match the regular expression and extract
            all those csv file and concatenate them into a single dataframe.
            For example, file_name=".*\.csv", file_dir=".", regex=True will extract all the files with the .csv extension
            from the current directory.
            It can also be an URI (e.g. https://example.com/mycsv.csv)
        file_dir: The file directory where the file_name is located. When file_name is a regular expression and 
            the regex parameter is True, file_dir is the directory that is inspected for any files that match the
            regular expression. Optional.
            For files it defaults to . (ie the current directory). Ignored for URIs.
        regex: When True, the file_name is interpreted as a regular expression. Defaults to False.
        separator: The single character to be used as separator in the csv file. Defaults to , (comma).
        header: When True, the first line is interpreted as the header and the column names are extracted from it.
            When False, the first line is part of the data and the columns will have names like 0, 1, 2, etc.
            Defaults to True.
        skip_header_rows: Optional number of rows to skip at the top of the file, before the header.

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        IOError: raised when the file is not found.
    """

    def __init__(self, file_name: str, file_dir: Optional[str]=None, regex: bool=False, separator: str=",",
                 header: bool=True, skip_header_rows: Optional[int]=None,
                 named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(file_name=file_name, file_dir=file_dir, regex=regex, named_output=named_output, name=name, description=description, strict=strict)
        self.separator = separator
        self.header = header
        self.skip_header_rows = skip_header_rows
ReadParquetFileRule (BaseReadFileRule)

Reads one or multiple parquet files from a directory and persists it as a dataframe for subsequent rules to operate on.

Basic usage::

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
# reads a file data.csv and persists it as the main output of the rule
rule = ReadParquetFileRule("data.parquet", "/home/myuser/")
rule.apply(data)

# reads a file test_data.parquet and persists it as the input_data named output
# other rules can specify input_data as their named_input to operate on it
rule = ReadParquetFileRule("test_data.parquet", "/home/myuser/", named_output="input_data")
rule.apply(data)

# reads all the files with the .parquet extension from the home dir of myuser and
# concatenates them into a single dataframe
rule = ReadParquetFileRule(".*\.parquet", "/home/myuser/", named_output="input_data")
rule.apply(data)

# reads only the A,B,C columns from the file data.csv file
rule = ReadParquetFileRule("data.parquet", "/home/myuser/", columns=["A", "B", "C"])
rule.apply(data)

# reads only those rows where column A is greater than 10 and column B is True
rule = ReadParquetFileRule("data.parquet", "/home/myuser/", filters=[["A", ">=", 10], ["B", "==", True]])
rule.apply(data)

Parameters:

Name Type Description Default
file_name str

The name of the parquet file to load. The format will be inferred from the extension of the file. file_name can also be a regular expression (specify regex=True in that case). The reader will find all the files in the file_dir directory that match the regular expression and extract all those parquet file and concatenate them into a single dataframe. For example, file_name=".*.parquet", file_dir=".", regex=True will extract all the files with the .parquet extension from the current directory.

required
file_dir str

The file directory where the file_name is located. When file_name is a regular expression and the regex parameter is True, file_dir is the directory that is inspected for any files that match the regular expression. Defaults to . (ie the current directory).

'.'
regex bool

When True, the file_name is interpreted as a regular expression. Defaults to False.

False
columns Optional[Sequence[str]]

A subset of the columns in the parquet file to load.

None
filters Union[List[Tuple], List[List[Tuple]]]

A list of filters to apply to filter the rows returned. Rows which do not match the filter conditions will be removed from scanned data. When passed as a List[List[Tuple]], the conditions in the inner lists are AND-end together with the top level condition OR-ed together. Eg: ((cond1 AND cond2...) OR (cond3 AND cond4...)...) When passed as a List[Tuple], the conditions are AND-ed together. E.g.: cond1 AND cond2 AND cond3... Each condition is specified as a tuple of 3 elements: (column, operation, value). Column is the name of a column in the input dataframe. Operation is one of: "==", "=", ">", ">=", "<", "<=", "!=", "in", "not in". Value is a scalar value, int, float, string, etc. When the operation is in or not in, the value must be a list, tuple or set of values.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
IOError

raised when the file is not found.

ValueError

raised if filters are specified but the format is incorrect.

MissingColumnError

raised if a column is specified in columns or filters but it doesn't exist in the input dataframe.

Note

The parquet file can be compressed in which case the compression will be inferred from the file. The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".

Source code in etlrules/backends/common/io/files.py
class ReadParquetFileRule(BaseReadFileRule):
    r""" Reads one or multiple parquet files from a directory and persists it as a dataframe for subsequent rules to operate on.

    Basic usage::

        # reads a file data.csv and persists it as the main output of the rule
        rule = ReadParquetFileRule("data.parquet", "/home/myuser/")
        rule.apply(data)

        # reads a file test_data.parquet and persists it as the input_data named output
        # other rules can specify input_data as their named_input to operate on it
        rule = ReadParquetFileRule("test_data.parquet", "/home/myuser/", named_output="input_data")
        rule.apply(data)

        # reads all the files with the .parquet extension from the home dir of myuser and
        # concatenates them into a single dataframe
        rule = ReadParquetFileRule(".*\.parquet", "/home/myuser/", named_output="input_data")
        rule.apply(data)

        # reads only the A,B,C columns from the file data.csv file
        rule = ReadParquetFileRule("data.parquet", "/home/myuser/", columns=["A", "B", "C"])
        rule.apply(data)

        # reads only those rows where column A is greater than 10 and column B is True
        rule = ReadParquetFileRule("data.parquet", "/home/myuser/", filters=[["A", ">=", 10], ["B", "==", True]])
        rule.apply(data)

    Args:
        file_name: The name of the parquet file to load. The format will be inferred from the extension of the file.
            file_name can also be a regular expression (specify regex=True in that case).
            The reader will find all the files in the file_dir directory that match the regular expression and extract
            all those parquet file and concatenate them into a single dataframe.
            For example, file_name=".*\.parquet", file_dir=".", regex=True will extract all the files with the .parquet extension
            from the current directory.
        file_dir: The file directory where the file_name is located. When file_name is a regular expression and 
            the regex parameter is True, file_dir is the directory that is inspected for any files that match the
            regular expression.
            Defaults to . (ie the current directory).
        regex: When True, the file_name is interpreted as a regular expression. Defaults to False.
        columns: A subset of the columns in the parquet file to load.
        filters: A list of filters to apply to filter the rows returned. Rows which do not match the filter conditions
            will be removed from scanned data.
            When passed as a List[List[Tuple]], the conditions in the inner lists are AND-end together with the top level
            condition OR-ed together. Eg: ((cond1 AND cond2...) OR (cond3 AND cond4...)...)
            When passed as a List[Tuple], the conditions are AND-ed together. E.g.: cond1 AND cond2 AND cond3...
            Each condition is specified as a tuple of 3 elements: (column, operation, value).
            Column is the name of a column in the input dataframe.
            Operation is one of: "==", "=", ">", ">=", "<", "<=", "!=", "in", "not in".
            Value is a scalar value, int, float, string, etc. When the operation is in or not in, the value must be a list, tuple or set of values.

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        IOError: raised when the file is not found.
        ValueError: raised if filters are specified but the format is incorrect.
        MissingColumnError: raised if a column is specified in columns or filters but it doesn't exist in the input dataframe.

    Note:
        The parquet file can be compressed in which case the compression will be inferred from the file.
        The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".
    """

    SUPPORTED_FILTERS_OPS = {"==", "=", ">", ">=", "<", "<=", "!=", "in", "not in"}

    def __init__(self, file_name: str, file_dir: str=".", columns: Optional[Sequence[str]]=None, filters:Optional[Union[List[Tuple], List[List[Tuple]]]]=None, regex: bool=False, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(
            file_name=file_name, file_dir=file_dir, regex=regex, named_output=named_output,
            name=name, description=description, strict=strict)
        self.columns = columns
        self.filters = self._get_filters(filters) if filters is not None else None

    def _raise_filters_invalid(self, error: str) -> NoReturn:
        raise ValueError(f"Invalid filters. It must be a List[Tuple] or List[List[Tuple]] with each Tuple being (column, op, value): {error}")

    def _validate_tuple(self, tpl):
        if len(tpl) != 3 or not isinstance(tpl[0], str) or not isinstance(tpl[1], str):
            self._raise_filters_invalid(f"Third level expected a list/tuple (cond, op, value), got: {tpl}.")
        op = tpl[1]
        if op not in self.SUPPORTED_FILTERS_OPS:
            self._raise_filters_invalid(f"Invalid operator {op} in {tpl}. Must be one of: {self.SUPPORTED_FILTERS_OPS}.")
        value = tpl[2]
        if op in ("in", "not in"):
            if not isinstance(value, (list, tuple, set)):
                self._raise_filters_invalid(f"Invalid value type for {value} for {op} operand in {tpl}. Must be list/tuple/set.")
            else:
                value = list(value)
        return (tpl[0], op, value)

    def _get_filters(self, filters):
        lst = []
        if isinstance(filters, (list, tuple)):
            if not filters:
                return None
            for filter2 in filters:
                if isinstance(filter2, (list, tuple)) and filter2:
                    if len(filter2) == 3 and isinstance(filter2[0], str):
                        # List[Tuple] form
                        tpl = self._validate_tuple(filter2)
                        lst.append(tpl)
                    else:
                        lst2 = []
                        for filter3 in filter2:
                            if isinstance(filter3, (list, tuple)) and filter3:
                                tpl = self._validate_tuple(filter3)
                                lst2.append(tpl)
                            else:
                                self._raise_filters_invalid(f"Third level expected a list/tuple, got: {filter3}.")
                        lst.append(lst2)
                else:
                    self._raise_filters_invalid(f"Second level expected a list/tuple, got: {filter2}.")
        else:
           self._raise_filters_invalid(f"Top level expected a list/tuple, got: {filters}")
        return lst
WriteCSVFileRule (BaseWriteFileRule)

Writes an existing dataframe to a csv file (optionally compressed) on disk.

The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.

Basic usage::

1
2
3
4
5
6
7
# writes a file data.csv and persists the main output of the previous rule to it
rule = WriteCSVFileRule("data.csv", "/home/myuser/")
rule.apply(data)

# writes a file test_data.csv and persists the dataframe named input_data into it
rule = WriteCSVFileRule("test_data.csv", "/home/myuser/", named_input="input_data")
rule.apply(data)

Parameters:

Name Type Description Default
file_name str

The name of the csv file to write to disk. It will be written in the directory specified by the file_dir parameter.

required
file_dir str

The file directory where the file_name should be written. Defaults to . (ie the current directory).

'.'
separator str

The single character to separate values in the csv file. Defaults to , (comma).

','
header bool

When True, the first line will contain the columns separated by the separator. When False, the columns will not be written and the first line contains data. Defaults to True.

True
compression Optional[str]

Compress the csv file using a supported compression algorithms. Optional. When the compression is specified, the file_name must end with the extension associate with that compression format. The following options are supported: zip - file_name must end with .zip (e.g. output.csv.zip), will produced a zipped csv file gzip - file_name must end with .gz (e.g. output.csv.gz), will produced a gzipped csv file bz2 - file_name must end with .bz2 (e.g. output.csv.bz2), will produced a bzipped csv xz - file_name must end with .xz (e.g. output.csv.xz), will produced a xz-compressed csv file

None
named_input Optional[str]

Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True.

True
Source code in etlrules/backends/common/io/files.py
class WriteCSVFileRule(BaseWriteFileRule):
    """ Writes an existing dataframe to a csv file (optionally compressed) on disk.

    The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.

    Basic usage::

        # writes a file data.csv and persists the main output of the previous rule to it
        rule = WriteCSVFileRule("data.csv", "/home/myuser/")
        rule.apply(data)

        # writes a file test_data.csv and persists the dataframe named input_data into it
        rule = WriteCSVFileRule("test_data.csv", "/home/myuser/", named_input="input_data")
        rule.apply(data)

    Args:
        file_name: The name of the csv file to write to disk. It will be written in the directory
            specified by the file_dir parameter.
        file_dir: The file directory where the file_name should be written.
            Defaults to . (ie the current directory).
        separator: The single character to separate values in the csv file. Defaults to , (comma).
        header: When True, the first line will contain the columns separated by the separator.
            When False, the columns will not be written and the first line contains data.
            Defaults to True.
        compression: Compress the csv file using a supported compression algorithms. Optional.
            When the compression is specified, the file_name must end with the extension associate with that
            compression format. The following options are supported:
            zip - file_name must end with .zip (e.g. output.csv.zip), will produced a zipped csv file
            gzip - file_name must end with .gz (e.g. output.csv.gz), will produced a gzipped csv file
            bz2 - file_name must end with .bz2 (e.g. output.csv.bz2), will produced a bzipped csv 
            xz - file_name must end with .xz (e.g. output.csv.xz), will produced a xz-compressed csv file

        named_input (Optional[str]): Select by name the dataframe to write from the input data.
            Optional. When not specified, the main output of the previous rule will be written.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True.
    """

    COMPRESSIONS = {
        'zip': '.zip',
        'gzip': '.gz',
        'bz2': '.bz2',
        'xz': '.xz',
    }

    def __init__(self, file_name: str, file_dir: str=".", separator: str=",", header: bool=True, compression: Optional[str]=None, named_input: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(
            file_name=file_name, file_dir=file_dir, named_input=named_input, 
            name=name, description=description, strict=strict)
        self.separator = separator
        self.header = header
        assert compression is None or compression in self.COMPRESSIONS.keys(), f"Unsupported compression '{compression}'. It must be one of: {self.COMPRESSIONS.keys()}."
        if compression:
            assert file_name.endswith(self.COMPRESSIONS[compression]), f"The file name {file_name} must have the extension {self.COMPRESSIONS[compression]} when the compression is set to {compression}."
        self.compression = compression
WriteParquetFileRule (BaseWriteFileRule)

Writes an existing dataframe to a parquet file on disk.

The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.

Basic usage::

1
2
3
4
5
6
7
# writes a file data.parquet and persists the main output of the previous rule to it
rule = WriteParquetFileRule("data.parquet", "/home/myuser/")
rule.apply(data)

# writes a file test_data.parquet and persists the dataframe named input_data into it
rule = WriteParquetFileRule("test_data.parquet", "/home/myuser/", named_input="input_data")
rule.apply(data)

Parameters:

Name Type Description Default
file_name str

The name of the parquet file to write to disk. It will be written in the directory specified by the file_dir parameter.

required
file_dir str

The file directory where the file_name should be written. Defaults to . (ie the current directory).

'.'
compression Optional[str]

Compress the parquet file using a supported compression algorithms. Optional. The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".

None
named_input Optional[str]

Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True.

True
Source code in etlrules/backends/common/io/files.py
class WriteParquetFileRule(BaseWriteFileRule):
    """ Writes an existing dataframe to a parquet file on disk.

    The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.

    Basic usage::

        # writes a file data.parquet and persists the main output of the previous rule to it
        rule = WriteParquetFileRule("data.parquet", "/home/myuser/")
        rule.apply(data)

        # writes a file test_data.parquet and persists the dataframe named input_data into it
        rule = WriteParquetFileRule("test_data.parquet", "/home/myuser/", named_input="input_data")
        rule.apply(data)

    Args:
        file_name: The name of the parquet file to write to disk. It will be written in the directory
            specified by the file_dir parameter.
        file_dir: The file directory where the file_name should be written.
            Defaults to . (ie the current directory).
        compression: Compress the parquet file using a supported compression algorithms. Optional.
            The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".

        named_input (Optional[str]): Select by name the dataframe to write from the input data.
            Optional. When not specified, the main output of the previous rule will be written.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True.
    """

    COMPRESSIONS = ("snappy", "gzip", "brotli", "lz4", "zstd")

    def __init__(self, file_name: str, file_dir: str=".", compression: Optional[str]=None, named_input: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(
            file_name=file_name, file_dir=file_dir, named_input=named_input, 
            name=name, description=description, strict=strict)
        assert compression is None or compression in self.COMPRESSIONS, f"Unsupported compression '{compression}'. It must be one of: {self.COMPRESSIONS}."
        self.compression = compression

joins

BaseJoinRule (BinaryOpBaseRule)
Source code in etlrules/backends/common/joins.py
class BaseJoinRule(BinaryOpBaseRule):

    JOIN_TYPE = None

    def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], key_columns_left: Iterable[str], key_columns_right: Optional[Iterable[str]]=None, suffixes: Iterable[Optional[str]]=(None, "_r"), named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)
        assert isinstance(key_columns_left, (list, tuple)) and key_columns_left and all(isinstance(col, str) for col in key_columns_left), "JoinRule: key_columns_left must a non-empty list of tuple with str column names"
        self.key_columns_left = [col for col in key_columns_left]
        self.key_columns_right = [col for col in key_columns_right] if key_columns_right is not None else None
        assert isinstance(suffixes, (list, tuple)) and len(suffixes) == 2 and all(s is None or isinstance(s, str) for s in suffixes), "The suffixes must be a list or tuple of 2 elements"
        self.suffixes = suffixes

    def _get_key_columns(self):
        return self.key_columns_left, self.key_columns_right or self.key_columns_left

    def do_apply(self, left_df, right_df):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        assert self.JOIN_TYPE in {"left", "right", "outer", "inner"}
        super().apply(data)
        left_df = self._get_input_df_left(data)
        right_df = self._get_input_df_right(data)
        left_on, right_on = self._get_key_columns()
        if not set(left_on) <= set(left_df.columns):
            raise MissingColumnError(f"Missing columns in join in the left dataframe: {set(left_on) - set(left_df.columns)}")
        if not set(right_on) <= set(right_df.columns):
            raise MissingColumnError(f"Missing columns in join in the right dataframe: {set(right_on) - set(right_df.columns)}")
        df = self.do_apply(left_df, right_df)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/joins.py
def apply(self, data):
    assert self.JOIN_TYPE in {"left", "right", "outer", "inner"}
    super().apply(data)
    left_df = self._get_input_df_left(data)
    right_df = self._get_input_df_right(data)
    left_on, right_on = self._get_key_columns()
    if not set(left_on) <= set(left_df.columns):
        raise MissingColumnError(f"Missing columns in join in the left dataframe: {set(left_on) - set(left_df.columns)}")
    if not set(right_on) <= set(right_df.columns):
        raise MissingColumnError(f"Missing columns in join in the right dataframe: {set(right_on) - set(right_df.columns)}")
    df = self.do_apply(left_df, right_df)
    self._set_output_df(data, df)
InnerJoinRule (BaseJoinRule)

Performs a database-style inner join operation on two data frames.

A join involves two data frames left_df right_df with the result performing a database style join or a merge of the two, with the resulting columns coming from both dataframes. For example, if the left dataframe has two columns A, B and the right dataframe has two column A, C, and assuming A is the key column the result will have three columns A, B, C. The rows that have the same value in the key column A will be merged on the same row in the result dataframe.

An inner join specifies that only those rows that have key values in both left and right will be copied over and merged into the result data frame. Any rows without corresponding values on the other side (be it left or right) will be dropped from the result.

Examples:

left dataframe::

1
2
3
| A  | B  |
| 1  | a  |
| 2  | b  |

right dataframe::

1
2
3
| A  | C  |
| 1  | c  |
| 3  | d  |

result (key columns=["A"])::

1
2
| A  | B  | C  |
| 1  | a  | c  |

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
key_columns_left Iterable[str]

A list or tuple of column names to join on (columns in the left data frame)

required
key_columns_right Optional[Iterable[str]]

A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too.

required
suffixes Iterable[Optional[str]]

A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns).

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns (keys) are missing from any of the two input data frames.

Source code in etlrules/backends/common/joins.py
class InnerJoinRule(BaseJoinRule):
    """ Performs a database-style inner join operation on two data frames.

    A join involves two data frames left_df <join> right_df with the result performing a
    database style join or a merge of the two, with the resulting columns coming from both
    dataframes.
    For example, if the left dataframe has two columns A, B and the right dataframe has two
    column A, C, and assuming A is the key column the result will have three columns A, B, C.
    The rows that have the same value in the key column A will be merged on the same row in the
    result dataframe.

    An inner join specifies that only those rows that have key values in both left and right
    will be copied over and merged into the result data frame. Any rows without corresponding
    values on the other side (be it left or right) will be dropped from the result.

    Example:

    left dataframe::

        | A  | B  |
        | 1  | a  |
        | 2  | b  |

    right dataframe::

        | A  | C  |
        | 1  | c  |
        | 3  | d  |

    result (key columns=["A"])::

        | A  | B  | C  |
        | 1  | a  | c  |

    Args:
        named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
        key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
            If not set or set to None, the key_columns_left is used on the right dataframe too.
        suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
            result data frame for those columns that have the same name (and are not key columns).

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
    """

    JOIN_TYPE = "inner"
LeftJoinRule (BaseJoinRule)

Performs a database-style left join operation on two data frames.

A join involves two data frames left_df right_df with the result performing a database style join or a merge of the two, with the resulting columns coming from both dataframes. For example, if the left dataframe has two columns A, B and the right dataframe has two column A, C, and assuming A is the key column the result will have three columns A, B, C. The rows that have the same value in the key column A will be merged on the same row in the result dataframe.

A left join specifies that all the rows in the left dataframe will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the right dataframe. The right columns will be populated with NaNs/None when there is no corresponding row on the right.

Examples:

left dataframe::

1
2
3
| A  | B  |
| 1  | a  |
| 2  | b  |

right dataframe::

1
2
3
| A  | C  |
| 1  | c  |
| 3  | d  |

result (key columns=["A"])::

1
2
3
| A  | B  | C  |
| 1  | a  | c  |
| 2  | b  | NA |

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
key_columns_left Iterable[str]

A list or tuple of column names to join on (columns in the left data frame)

required
key_columns_right Optional[Iterable[str]]

A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too.

required
suffixes Iterable[Optional[str]]

A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns).

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns (keys) are missing from any of the two input data frames.

Source code in etlrules/backends/common/joins.py
class LeftJoinRule(BaseJoinRule):
    """ Performs a database-style left join operation on two data frames.

    A join involves two data frames left_df <join> right_df with the result performing a
    database style join or a merge of the two, with the resulting columns coming from both
    dataframes.
    For example, if the left dataframe has two columns A, B and the right dataframe has two
    column A, C, and assuming A is the key column the result will have three columns A, B, C.
    The rows that have the same value in the key column A will be merged on the same row in the
    result dataframe.

    A left join specifies that all the rows in the left dataframe will be present in the result,
    irrespective of whether there's a corresponding row with the same values in the key columns in
    the right dataframe. The right columns will be populated with NaNs/None when there is no
    corresponding row on the right.

    Example:

    left dataframe::

        | A  | B  |
        | 1  | a  |
        | 2  | b  |

    right dataframe::

        | A  | C  |
        | 1  | c  |
        | 3  | d  |

    result (key columns=["A"])::

        | A  | B  | C  |
        | 1  | a  | c  |
        | 2  | b  | NA |

    Args:
        named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
        key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
            If not set or set to None, the key_columns_left is used on the right dataframe too.
        suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
            result data frame for those columns that have the same name (and are not key columns).

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
    """

    JOIN_TYPE = "left"
OuterJoinRule (BaseJoinRule)

Performs a database-style left join operation on two data frames.

A join involves two data frames left_df right_df with the result performing a database style join or a merge of the two, with the resulting columns coming from both dataframes. For example, if the left dataframe has two columns A, B and the right dataframe has two column A, C, and assuming A is the key column the result will have three columns A, B, C. The rows that have the same value in the key column A will be merged on the same row in the result dataframe.

An outer join specifies that all the rows in the both left and right dataframes will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the other dataframe. The missing side will have its columns populated with NA when the rows are missing.

Examples:

left dataframe::

1
2
3
| A  | B  |
| 1  | a  |
| 2  | b  |

right dataframe::

1
2
3
| A  | C  |
| 1  | c  |
| 3  | d  |

result (key columns=["A"])::

1
2
3
4
| A  | B  | C  |
| 1  | a  | c  |
| 2  | b  | NA |
| 3  | NA | d  |

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
key_columns_left Iterable[str]

A list or tuple of column names to join on (columns in the left data frame)

required
key_columns_right Optional[Iterable[str]]

A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too.

required
suffixes Iterable[Optional[str]]

A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns).

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns (keys) are missing from any of the two input data frames.

Source code in etlrules/backends/common/joins.py
class OuterJoinRule(BaseJoinRule):
    """ Performs a database-style left join operation on two data frames.

    A join involves two data frames left_df <join> right_df with the result performing a
    database style join or a merge of the two, with the resulting columns coming from both
    dataframes.
    For example, if the left dataframe has two columns A, B and the right dataframe has two
    column A, C, and assuming A is the key column the result will have three columns A, B, C.
    The rows that have the same value in the key column A will be merged on the same row in the
    result dataframe.

    An outer join specifies that all the rows in the both left and right dataframes will be present
    in the result, irrespective of whether there's a corresponding row with the same values in the
    key columns in the other dataframe. The missing side will have its columns populated with NA
    when the rows are missing.

    Example:

    left dataframe::

        | A  | B  |
        | 1  | a  |
        | 2  | b  |

    right dataframe::

        | A  | C  |
        | 1  | c  |
        | 3  | d  |

    result (key columns=["A"])::

        | A  | B  | C  |
        | 1  | a  | c  |
        | 2  | b  | NA |
        | 3  | NA | d  |

    Args:
        named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
        key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
            If not set or set to None, the key_columns_left is used on the right dataframe too.
        suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
            result data frame for those columns that have the same name (and are not key columns).

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
    """

    JOIN_TYPE = "outer"
RightJoinRule (BaseJoinRule)

Performs a database-style left join operation on two data frames.

A join involves two data frames left_df right_df with the result performing a database style join or a merge of the two, with the resulting columns coming from both dataframes. For example, if the left dataframe has two columns A, B and the right dataframe has two column A, C, and assuming A is the key column the result will have three columns A, B, C. The rows that have the same value in the key column A will be merged on the same row in the result dataframe.

A right join specifies that all the rows in the right dataframe will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the left dataframe. The left columns will be populated with NA when there is no corresponding row on the left.

Examples:

left dataframe::

1
2
3
| A  | B  |
| 1  | a  |
| 2  | b  |

right dataframe::

1
2
3
| A  | C  |
| 1  | c  |
| 3  | d  |

result (key columns=["A"])::

1
2
3
| A  | B  | C  |
| 1  | a  | c  |
| 3  | NA | d  |

Note

A right join is equivalent to a left join with the dataframes inverted, ie: left_df right_df is equivalent to right_df left_df although the order of the rows will be different.

Parameters:

Name Type Description Default
named_input_left Optional[str]

Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_input_right Optional[str]

Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule.

required
key_columns_left Iterable[str]

A list or tuple of column names to join on (columns in the left data frame)

required
key_columns_right Optional[Iterable[str]]

A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too.

required
suffixes Iterable[Optional[str]]

A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns).

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if any columns (keys) are missing from any of the two input data frames.

Source code in etlrules/backends/common/joins.py
class RightJoinRule(BaseJoinRule):
    """ Performs a database-style left join operation on two data frames.

    A join involves two data frames left_df <join> right_df with the result performing a
    database style join or a merge of the two, with the resulting columns coming from both
    dataframes.
    For example, if the left dataframe has two columns A, B and the right dataframe has two
    column A, C, and assuming A is the key column the result will have three columns A, B, C.
    The rows that have the same value in the key column A will be merged on the same row in the
    result dataframe.

    A right join specifies that all the rows in the right dataframe will be present in the result,
    irrespective of whether there's a corresponding row with the same values in the key columns in
    the left dataframe. The left columns will be populated with NA when there is no
    corresponding row on the left.

    Example:

    left dataframe::

        | A  | B  |
        | 1  | a  |
        | 2  | b  |

    right dataframe::

        | A  | C  |
        | 1  | c  |
        | 3  | d  |

    result (key columns=["A"])::

        | A  | B  | C  |
        | 1  | a  | c  |
        | 3  | NA | d  |

    Note:
        A right join is equivalent to a left join with the dataframes inverted, ie:
        left_df <left_join> right_df
        is equivalent to
        right_df <right_join> left_df
        although the order of the rows will be different.

    Args:
        named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
            When set to None, the input is taken from the main output of the previous rule.
            Set it to a string value, the name of an output dataframe of a previous rule.
        key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
        key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
            If not set or set to None, the key_columns_left is used on the right dataframe too.
        suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
            result data frame for those columns that have the same name (and are not key columns).

        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
    """

    JOIN_TYPE = "right"

newcolumns

AddNewColumnRule (UnaryOpBaseRule)

Adds a new column and sets it to the value of an evaluated expression.

Example::

1
2
3
4
5
Given df:
| A   | B  |
| 1   | 2  |
| 2   | 3  |
| 3   | 4  |

AddNewColumnRule("Sum", "df['A'] + df['B']").apply(df)

Result::

1
2
3
4
| A   | B  | Sum |
| 1   | 2  | 3   |
| 2   | 3  | 5   |
| 3   | 4  | 7   |

Parameters:

Name Type Description Default
output_column str

The name of the new column to be added.

required
column_expression str

An expression that gets evaluated and produces the value for the new column. The syntax: df["EXISTING_COL"] can be used in the expression to refer to other columns in the dataframe.

required
column_type Optional[str]

An optional type to convert the result to. If not specified, the type is determined from the output of the expression, which can sometimes differ based on the backend. If the input dataframe is empty, this type ensures the column will be of the specified type, rather than default to string type.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised in strict mode only if a column with the same name already exists in the dataframe.

ExpressionSyntaxError

raised if the column expression has a Python syntax error.

UnsupportedTypeError

raised if the column_type parameter is specified and not supported.

TypeError

raised if an operation is not supported between the types involved. raised when the column type is specified but the conversion to that type fails.

NameError

raised if an unknown variable is used

KeyError

raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])

Note

The implementation will try to use dataframe operations for performance, but when those are not supported it will fallback to row level operations.

Note

NA are treated slightly differently between dataframe level operations and row level. At dataframe level operations, NAs in operations will make the result be NA. In row level operations, NAs will generally raise a TypeError. To avoid such behavior, fill the NAs before performing operations.

Source code in etlrules/backends/common/newcolumns.py
class AddNewColumnRule(UnaryOpBaseRule):
    """ Adds a new column and sets it to the value of an evaluated expression.

    Example::

        Given df:
        | A   | B  |
        | 1   | 2  |
        | 2   | 3  |
        | 3   | 4  |

    > AddNewColumnRule("Sum", "df['A'] + df['B']").apply(df)

    Result::

        | A   | B  | Sum |
        | 1   | 2  | 3   |
        | 2   | 3  | 5   |
        | 3   | 4  | 7   |

    Args:
        output_column: The name of the new column to be added.
        column_expression: An expression that gets evaluated and produces the value for the new column.
            The syntax: df["EXISTING_COL"] can be used in the expression to refer to other columns in the dataframe.
        column_type: An optional type to convert the result to. If not specified, the type is determined from the
            output of the expression, which can sometimes differ based on the backend.
            If the input dataframe is empty, this type ensures the column will be of the specified type, rather than
            default to string type.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
        ExpressionSyntaxError: raised if the column expression has a Python syntax error.
        UnsupportedTypeError: raised if the column_type parameter is specified and not supported.
        TypeError: raised if an operation is not supported between the types involved. raised when the column type is specified
            but the conversion to that type fails.
        NameError: raised if an unknown variable is used
        KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])

    Note:
        The implementation will try to use dataframe operations for performance, but when those are not supported it
        will fallback to row level operations.

    Note:
        NA are treated slightly differently between dataframe level operations and row level.
        At dataframe level operations, NAs in operations will make the result be NA.
        In row level operations, NAs will generally raise a TypeError.
        To avoid such behavior, fill the NAs before performing operations.
    """

    EXCLUDE_FROM_COMPARE = ('_column_expression', )

    def __init__(self, output_column: str, column_expression: str, column_type: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.output_column = output_column
        self.column_expression = column_expression
        if column_type is not None and column_type not in SUPPORTED_TYPES:
            raise UnsupportedTypeError(f"Unsupported column type: '{column_type}'. It must be one of: {SUPPORTED_TYPES}")
        self.column_type = column_type
        self._column_expression = self.get_column_expression()

    def _validate_columns(self, df_columns):
        if self.strict and self.output_column in df_columns:
            raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")
AddRowNumbersRule (UnaryOpBaseRule)

Adds a new column with row numbers.

Example::

1
2
3
4
5
Given df:
| A   | B  |
| 1   | 2  |
| 2   | 3  |
| 3   | 4  |

AddRowNumbersRule("Row_Number").apply(df)

Result::

1
2
3
4
| A   | B  | Row_Number |
| 1   | 2  | 0          |
| 2   | 3  | 1          |
| 3   | 4  | 2          |

Parameters:

Name Type Description Default
output_column str

The name of the new column to be added.

required
start int

The value to start the numbers from. Defaults to 0.

0
step int

The increment to be used between row numbers. Defaults to 1.

1
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
ColumnAlreadyExistsError

raised in strict mode only if a column with the same name already exists in the dataframe.

Source code in etlrules/backends/common/newcolumns.py
class AddRowNumbersRule(UnaryOpBaseRule):
    """ Adds a new column with row numbers.

    Example::

        Given df:
        | A   | B  |
        | 1   | 2  |
        | 2   | 3  |
        | 3   | 4  |

    > AddRowNumbersRule("Row_Number").apply(df)

    Result::

        | A   | B  | Row_Number |
        | 1   | 2  | 0          |
        | 2   | 3  | 1          |
        | 3   | 4  | 2          |

    Args:
        output_column: The name of the new column to be added.
        start: The value to start the numbers from. Defaults to 0.
        step: The increment to be used between row numbers. Defaults to 1.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
    """

    def __init__(self, output_column: str, start: int=0, step: int=1, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.output_column = output_column
        self.start = start
        self.step = step

    def _validate_columns(self, df_columns):
        if self.strict and self.output_column in df_columns:
            raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")

numeric

AbsRule (BaseAssignColumnRule)

Converts numbers to absolute values.

Basic usage::

1
2
rule = AbsRule("col_A")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The name of the column to convert to absolute values.

required
output_column Optional[str]

An optional new column with the absolute values. If provided the existing column is unchanged and a new column is created with the absolute values. If not provided, the result is updated in place.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/numeric.py
class AbsRule(BaseAssignColumnRule):
    """ Converts numbers to absolute values.

    Basic usage::

        rule = AbsRule("col_A")
        rule.apply(data)

    Args:
        input_column (str): The name of the column to convert to absolute values.
        output_column (Optional[str]): An optional new column with the absolute values.
            If provided the existing column is unchanged and a new column is created with the absolute values.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """
RoundRule (BaseAssignColumnRule)

Rounds a set of columns to specified decimal places.

Basic usage::

1
2
3
4
5
6
7
# rounds Col_A to 2dps
rule = RoundRule("Col_A", 2)
rule.apply(data)

# rounds Col_B to 0dps and output the results into Col_C, Col_B remains unchanged
rule = RoundRule("Col_B", 0, output_column="Col_C")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A column with values to round as per the specified scale.

required
scale Union[int, Sequence[int]]

An integer specifying the number of decimal places to round to.

required
output_column Optional[str]

An optional name for a new column with the rounded values. If provided, the existing column is unchanged and the new column is created with the results. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/numeric.py
class RoundRule(BaseAssignColumnRule):
    """ Rounds a set of columns to specified decimal places.

    Basic usage::

        # rounds Col_A to 2dps
        rule = RoundRule("Col_A", 2)
        rule.apply(data)

        # rounds Col_B to 0dps and output the results into Col_C, Col_B remains unchanged
        rule = RoundRule("Col_B", 0, output_column="Col_C")
        rule.apply(data)

    Args:
        input_column: A column with values to round as per the specified scale.
        scale: An integer specifying the number of decimal places to round to.
        output_column (Optional[str]): An optional name for a new column with the rounded values.
            If provided, the existing column is unchanged and the new column is created with the results.
            If not provided, the result is updated in place.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    def __init__(self, input_column: str, scale: Union[int, Sequence[int]], output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        assert isinstance(scale, int), "scale must be an integer value"
        self.scale = scale

strings

StrCapitalizeRule (BaseAssignColumnRule)

Converts the values in a string column to capitalized values.

Capitalization will convert the first letter in the string to upper case and the rest of the letters to lower case.

Basic usage::

1
2
rule = StrCapitalizeRule("col_A")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A string column with the values to capitalize.

required
output_column Optional[str]

An optional new names for the column with the capitalized values. If provided, the existing column is unchanged, and a new column is created with the capitalized values. If not provided, the result is updated in place.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrCapitalizeRule(BaseAssignColumnRule):
    """ Converts the values in a string column to capitalized values.

    Capitalization will convert the first letter in the string to upper case and the rest of the letters
    to lower case.

    Basic usage::

        rule = StrCapitalizeRule("col_A")
        rule.apply(data)

    Args:
        input_column (str): A string column with the values to capitalize.
        output_column (Optional[str]): An optional new names for the column with the capitalized values.
            If provided, the existing column is unchanged, and a new column is created with the capitalized values.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """
StrExtractRule (UnaryOpBaseRule, ColumnsInOutMixin)

Extract substrings from strings columns using regular expressions.

Basic usage::

1
2
3
4
5
6
7
8
9
# extracts the number between start_ and _end
# ie: for an input value of start_1234_end - will extract 1234 in col_A
rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end")
rule.apply(data)

# extracts with multiple groups, extracting the single digit at the end as well
# for an input value of start_1234_end_9, col_1 will extract 1234, col_2 will extract 9
rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end_([\d])", output_columns=["col_1", "col_2"])
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A column to extract data from.

required
regular_expression str

The regular expression used to extract data. The regular expression must have 1 or more groups - ie sections between brackets. The groups do the actual extraction of data. If there is a single group, then the column can be modified in place (ie no output_columns are needed) but if there are multiple groups, then output_columns must be specified as each group will be extracted in a new output column.

required
keep_original_value bool

Only used in case there isn't a match and it specifies if NA should be used in the output or the original value. Defaults: True. If the regular expression has multiple groups and therefore multiple output_columns, only the first output column will keep the original value, the rest will be populated with NA.

False
output_columns Optional[Iterable[str]]

A list of new names for the result columns. Optional. If provided, it must have one output_column per regular expression group. For example, given the regular expression "a_([\d])_([\d])" with 2 groups, then the output columns must have 2 columns (one per group) - for example ["out_1", "out_2"]. The existing columns are unchanged, and new columns are created with extracted values. If not provided, the result is updated in place (only possible if the regular expression has a single group).

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if an output_column already exists in the dataframe.

ValueError

raised if output_columns is provided and not the same length as the number of groups in the regular expression.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrExtractRule(UnaryOpBaseRule, ColumnsInOutMixin):
    r""" Extract substrings from strings columns using regular expressions.

    Basic usage::

        # extracts the number between start_ and _end
        # ie: for an input value of start_1234_end - will extract 1234 in col_A
        rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end")
        rule.apply(data)

        # extracts with multiple groups, extracting the single digit at the end as well
        # for an input value of start_1234_end_9, col_1 will extract 1234, col_2 will extract 9
        rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end_([\d])", output_columns=["col_1", "col_2"])
        rule.apply(data)

    Args:
        input_column (str): A column to extract data from.
        regular_expression: The regular expression used to extract data.
            The regular expression must have 1 or more groups - ie sections between brackets.
            The groups do the actual extraction of data.
            If there is a single group, then the column can be modified in place (ie no output_columns are needed) but
            if there are multiple groups, then output_columns must be specified as each group will be extracted in a new
            output column.
        keep_original_value: Only used in case there isn't a match and it specifies if NA should be used in the output or the original value.
            Defaults: True.
            If the regular expression has multiple groups and therefore multiple output_columns, only the first output column
            will keep the original value, the rest will be populated with NA.
        output_columns (Optional[Iterable[str]]): A list of new names for the result columns.
            Optional. If provided, it must have one output_column per regular expression group.
            For example, given the regular expression "a_([\d])_([\d])" with 2 groups, then
            the output columns must have 2 columns (one per group) - for example ["out_1", "out_2"].
            The existing columns are unchanged, and new columns are created with extracted values.
            If not provided, the result is updated in place (only possible if the regular expression has a single group).

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if an output_column already exists in the dataframe.
        ValueError: raised if output_columns is provided and not the same length as the number of groups in the regular expression.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    def __init__(self, input_column: str, regular_expression: str, keep_original_value: bool=False, output_columns:Optional[Iterable[str]]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        self.input_column = input_column
        self.output_columns = [out_col for out_col in output_columns] if output_columns else None
        self.regular_expression = regular_expression
        self._compiled_expr = re.compile(regular_expression)
        groups = self._compiled_expr.groups
        assert groups > 0, "The regular expression must have at least 1 group - ie a section in () - which gets extracted."
        if self.output_columns is not None:
            if len(self.output_columns) != groups:
                raise ValueError(f"The regular expression has {groups} group(s), the output_columns must have {groups} column(s).")
        if groups > 1 and self.output_columns is None:
            raise ValueError(f"The regular expression has more than 1 groups in which case output_columns must be specified (one per group).")
        self.keep_original_value = keep_original_value
StrLowerRule (BaseAssignColumnRule)

Converts the values in a string column to lower case.

Basic usage::

1
2
rule = StrLowerRule("col_A")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A string column to convert to lower case.

required
output_column Optional[str]

An optional new names for the column with the lower case values. If provided, the existing column is unchanged, and a new column is created with the lower case values. If not provided, the result is updated in place.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrLowerRule(BaseAssignColumnRule):
    """ Converts the values in a string column to lower case.

    Basic usage::

        rule = StrLowerRule("col_A")
        rule.apply(data)

    Args:
        input_column (str): A string column to convert to lower case.
        output_column (Optional[str]): An optional new names for the column with the lower case values.
            If provided, the existing column is unchanged, and a new column is created with the lower case values.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """
StrPadRule (BaseAssignColumnRule)

Makes strings of a given width (justifies) by padding left or right with a fill character.

Basic usage::

1
2
3
# a value of ABCD will ABCD....
rule = StrPadRule("col_A", width=8, fill_character=".", how="right")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A string column to be padded.

required
width int

Pad with the fill_character to this width.

required
fill_character str

Character to fill with. Defaults to whitespace.

required
how Literal['left', 'right']

How should the stripping be done. One of left or right. Left pads at the beggining of the string, right pads at the end. Default: left.

'left'
output_column Optional[str]

An optional new column with the padded results. If provided, the existing column is unchanged and a new column is created with the results. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrPadRule(BaseAssignColumnRule):
    """ Makes strings of a given width (justifies) by padding left or right with a fill character.

    Basic usage::

        # a value of ABCD will ABCD....
        rule = StrPadRule("col_A", width=8, fill_character=".", how="right")
        rule.apply(data)

    Args:
        input_column (str): A string column to be padded.
        width: Pad with the fill_character to this width.
        fill_character: Character to fill with. Defaults to whitespace.
        how: How should the stripping be done. One of left or right.
            Left pads at the beggining of the string, right pads at the end. Default: left.
        output_column (Optional[str]): An optional new column with the padded results.
            If provided, the existing column is unchanged and a new column is created with the results.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    PAD_LEFT = 'left'
    PAD_RIGHT = 'right'

    def __init__(self, input_column: str, width: int, fill_character: str, how: Literal[PAD_LEFT, PAD_RIGHT]=PAD_LEFT, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        assert how in (self.PAD_LEFT, self.PAD_RIGHT), f"Unknown how parameter {how}. It must be one of: {(self.PAD_LEFT, self.PAD_RIGHT)}"
        self.how = how
        self.width = width
        self.fill_character = fill_character
StrSplitRejoinRule (BaseAssignColumnRule)

Splits the values in a string column into an array of substrings based on a string separator then rejoin with a new separator, optionally sorting the substrings.

Note

The output is an array of substrings which can optionally be limited via the limit parameter to only include the first number of substrings.

Basic usage::

1
2
3
4
# splits the col_A column on ,
# "b,d;a,c" will be split and rejoined as "b|c|d;a"
rule = StrSplitRejoinRule("col_A", separator=",", new_separator="|", sort="ascending")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

The column to split and rejoin.

required
separator str

A literal value to split the string by.

required
limit Optional[int]

A limit to the number of substrings. If specified, only the first substrings are returned plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.

None
new_separator str

A new separator used to rejoin the substrings.

','
sort Optional[Literal['ascending', 'descending']]

Optionally sorts the substrings before rejoining using the new_separator. It can be set to either ascending or descending, sorting the substrings accordingly. When the value is set to None, there is no sorting.

None
output_column Optional[str]

An optional new column to hold the result. If provided, the existing column is unchanged and a new column is created with the result. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrSplitRejoinRule(BaseAssignColumnRule):
    """ Splits the values in a string column into an array of substrings based on a string separator then rejoin with a new separator, optionally sorting the substrings.

    Note:
        The output is an array of substrings which can optionally be limited via the limit parameter to only
        include the first <limit> number of substrings.

    Basic usage::

        # splits the col_A column on ,
        # "b,d;a,c" will be split and rejoined as "b|c|d;a"
        rule = StrSplitRejoinRule("col_A", separator=",", new_separator="|", sort="ascending")
        rule.apply(data)

    Args:
        input_column (str): The column to split and rejoin.
        separator: A literal value to split the string by.
        limit: A limit to the number of substrings. If specified, only the first <limit> substrings are returned
            plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.
        new_separator: A new separator used to rejoin the substrings.
        sort: Optionally sorts the substrings before rejoining using the new_separator.
            It can be set to either ascending or descending, sorting the substrings accordingly.
            When the value is set to None, there is no sorting.
        output_column (Optional[str]): An optional new column to hold the result.
            If provided, the existing column is unchanged and a new column is created with the result.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    SORT_ASCENDING = "ascending"
    SORT_DESCENDING = "descending"

    def __init__(self, input_column: str, separator: str, limit:Optional[int]=None, new_separator:str=",", sort:Optional[Literal[SORT_ASCENDING, SORT_DESCENDING]]=None, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        assert separator and isinstance(separator, str)
        self.separator = separator
        self.limit = limit
        assert isinstance(new_separator, str) and new_separator
        self.new_separator = new_separator
        assert sort in (None, self.SORT_ASCENDING, self.SORT_DESCENDING)
        self.sort = sort
StrSplitRule (BaseAssignColumnRule)

Splits a string into an array of substrings based on a string separator.

Note

The output is an array of substrings which can optionally be limited via the limit parameter to only include the first number of substrings. If you need the output to be a string, perhaps joined on a different separator and optionally sorted then use the StrSplitRejoinRule rule.

Basic usage::

1
2
3
4
# splits the col_A column on ,
# "a,b;c,d" will be split as ["a", "b;c", "d"]
rule = StrSplitRule("col_A", separator=",")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A string column to split.

required
separator str

A literal value to split the string by.

required
limit Optional[int]

A limit to the number of substrings. If specified, only the first substrings are returned plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.

None
output_column Optional[str]

An optional column to hold the result of the split. If provided, the existing column is unchanged and a new column is created with the result. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if the input_column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrSplitRule(BaseAssignColumnRule):
    """ Splits a string into an array of substrings based on a string separator.

    Note:
        The output is an array of substrings which can optionally be limited via the limit parameter to only
        include the first <limit> number of substrings.
        If you need the output to be a string, perhaps joined on a different separator and optionally sorted
        then use the StrSplitRejoinRule rule.

    Basic usage::

        # splits the col_A column on ,
        # "a,b;c,d" will be split as ["a", "b;c", "d"]
        rule = StrSplitRule("col_A", separator=",")
        rule.apply(data)

    Args:
        input_column (str): A string column to split.
        separator: A literal value to split the string by.
        limit: A limit to the number of substrings. If specified, only the first <limit> substrings are returned
            plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.
        output_column (Optional[str]): An optional column to hold the result of the split.
            If provided, the existing column is unchanged and a new column is created with the result.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    def __init__(self, input_column: str, separator: str, limit: Optional[int]=None, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        assert separator and isinstance(separator, str)
        self.separator = separator
        self.limit = limit
StrStripRule (BaseAssignColumnRule)

Strips leading, trailing or both whitespaces or other characters from the values in the input column.

Basic usage::

1
2
rule = StrStripRule("col_A", how="both")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A input column to strip characters from its values.

required
how Literal['left', 'right', 'both']

How should the stripping be done. One of left, right, both. Left strips leading characters, right trailing characters and both at both ends.

'both'
characters Optional[str]

If set, it contains a list of characters to be stripped. When not specified or when set to None, whitespace is removed.

None
output_column Optional[str]

An optional new column to hold the results. If provided, the existing column is unchanged and a new column is created with the results. If not provided, the result is updated in place.

None
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrStripRule(BaseAssignColumnRule):
    """ Strips leading, trailing or both whitespaces or other characters from the values in the input column.

    Basic usage::

        rule = StrStripRule("col_A", how="both")
        rule.apply(data)

    Args:
        input_column (str): A input column to strip characters from its values.
        how: How should the stripping be done. One of left, right, both.
            Left strips leading characters, right trailing characters and both at both ends.
        characters: If set, it contains a list of characters to be stripped.
            When not specified or when set to None, whitespace is removed.
        output_column (Optional[str]): An optional new column to hold the results.
            If provided, the existing column is unchanged and a new column is created with the results.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

    STRIP_LEFT = 'left'
    STRIP_RIGHT = 'right'
    STRIP_BOTH = 'both'

    def __init__(self, input_column: str, how: Literal[STRIP_LEFT, STRIP_RIGHT, STRIP_BOTH]=STRIP_BOTH, characters: Optional[str]=None, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, 
                         name=name, description=description, strict=strict)
        assert how in (self.STRIP_BOTH, self.STRIP_LEFT, self.STRIP_RIGHT), f"Unknown how parameter {how}. It must be one of: {(self.STRIP_BOTH, self.STRIP_LEFT, self.STRIP_RIGHT)}"
        self.how = how
        self.characters = characters or None
StrUpperRule (BaseAssignColumnRule)

Converts the values in a string columns to upper case.

Basic usage::

1
2
rule = StrUpperRule("col_A")
rule.apply(data)

Parameters:

Name Type Description Default
input_column str

A string column to convert to upper case.

required
output_column Optional[str]

An optional new names for the column with the upper case values. If provided, the existing column is unchanged, and a new column is created with the upper case values. If not provided, the result is updated in place.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

required
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

required
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

required
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

required
strict bool

When set to True, the rule does a stricter valiation. Default: True

required

Exceptions:

Type Description
MissingColumnError

raised if a column doesn't exist in the input dataframe.

ColumnAlreadyExistsError

raised in strict mode only if the output_column already exists in the dataframe.

Note

In non-strict mode, the overwriting of existing columns is ignored.

Source code in etlrules/backends/common/strings.py
class StrUpperRule(BaseAssignColumnRule):
    """ Converts the values in a string columns to upper case.

    Basic usage::

        rule = StrUpperRule("col_A")
        rule.apply(data)

    Args:
        input_column (str): A string column to convert to upper case.
        output_column (Optional[str]): An optional new names for the column with the upper case values.
            If provided, the existing column is unchanged, and a new column is created with the upper case values.
            If not provided, the result is updated in place.

        named_input (Optional[str]): Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name (Optional[str]): Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised if a column doesn't exist in the input dataframe.
        ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.

    Note:
        In non-strict mode, the overwriting of existing columns is ignored.
    """

types

TypeConversionRule (UnaryOpBaseRule)

Converts the type of a given set of columns to other types.

Basic usage::

1
2
3
# converts column A to int64, B to string and C to datetime
rule = TypeConversionRule({"A": "int64", "B": "string", "C": "datetime"})
rule.apply(data)

Parameters:

Name Type Description Default
mapper Mapping[str, str]

A dict with columns names as keys and the new types as values. The supported types are: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, string, boolean, datetime and timedelta.

required
named_input Optional[str]

Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule.

None
named_output Optional[str]

Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output.

None
name Optional[str]

Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent.

None
description Optional[str]

Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule.

None
strict bool

When set to True, the rule does a stricter valiation. Default: True

True

Exceptions:

Type Description
MissingColumnError

raised when a column specified in the mapper doesn't exist in the input data frame.

UnsupportedTypeError

raised when an unknown type is speified in the values of the mapper.

ValueError

raised in strict mode if a value cannot be converted to the desired type. In non strict mode, the exception is not raised and the value is converted to NA.

Source code in etlrules/backends/common/types.py
class TypeConversionRule(UnaryOpBaseRule):
    """ Converts the type of a given set of columns to other types.

    Basic usage::

        # converts column A to int64, B to string and C to datetime
        rule = TypeConversionRule({"A": "int64", "B": "string", "C": "datetime"})
        rule.apply(data)

    Args:
        mapper: A dict with columns names as keys and the new types as values.
            The supported types are: int8, int16, int32, int64, uint8, uint16,
            uint32, uint64, float32, float64, string, boolean, datetime and timedelta.

        named_input: Which dataframe to use as the input. Optional.
            When not set, the input is taken from the main output.
            Set it to a string value, the name of an output dataframe of a previous rule.
        named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
            When not set, the result of this rule will be available as the main output.
            When set to a name (string), the result will be available as that named output.
        name: Give the rule a name. Optional.
            Named rules are more descriptive as to what they're trying to do/the intent.
        description: Describe in detail what the rules does, how it does it. Optional.
            Together with the name, the description acts as the documentation of the rule.
        strict: When set to True, the rule does a stricter valiation. Default: True

    Raises:
        MissingColumnError: raised when a column specified in the mapper doesn't exist in the input data frame.
        UnsupportedTypeError: raised when an unknown type is speified in the values of the mapper.
        ValueError: raised in strict mode if a value cannot be converted to the desired type.
            In non strict mode, the exception is not raised and the value is converted to NA.
    """

    def __init__(self, mapper: Mapping[str, str], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        assert isinstance(mapper, dict), "mapper needs to be a dict {column_name:type}"
        assert all(isinstance(key, str) and isinstance(val, str) for key, val in mapper.items()), "mapper needs to be a dict {column_name:type} where the names are str"
        super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
        self.mapper = mapper
        for column_name, type_str in self.mapper.items():
            if type_str not in SUPPORTED_TYPES:
                raise UnsupportedTypeError(f"Type '{type_str}' for column '{column_name}' is not currently supported.")


    def do_type_conversion(self, df, col, dtype):
        raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        columns_set = set(df.columns)
        for column_name in self.mapper:
            if column_name not in columns_set:
                raise MissingColumnError(f"Column '{column_name}' is missing in the data frame. Available columns: {sorted(columns_set)}")
        df = self.assign_do_apply_dict(df, {
            column_name: self.do_type_conversion(df, df[column_name], type_str) 
                for column_name, type_str in self.mapper.items()
        })
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/common/types.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    columns_set = set(df.columns)
    for column_name in self.mapper:
        if column_name not in columns_set:
            raise MissingColumnError(f"Column '{column_name}' is missing in the data frame. Available columns: {sorted(columns_set)}")
    df = self.assign_do_apply_dict(df, {
        column_name: self.do_type_conversion(df, df[column_name], type_str) 
            for column_name, type_str in self.mapper.items()
    })
    self._set_output_df(data, df)

dask special

basic

ExplodeValuesRule (ExplodeValuesRule)
Source code in etlrules/backends/dask/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_input_column(df)
        result = df.explode(self.input_column)
        if self.column_type:
            result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
        self._set_output_df(data, result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/basic.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_input_column(df)
    result = df.explode(self.input_column)
    if self.column_type:
        result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
    self._set_output_df(data, result)

conditions

FilterRule (FilterRule)
Source code in etlrules/backends/dask/conditions.py
class FilterRule(FilterRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename="FilterRule.py")

    def apply(self, data):
        df = self._get_input_df(data)
        cond_series = self._condition_expression.eval(df)
        if self.discard_matching_rows:
            cond_series = ~cond_series
        self._set_output_df(data, df[cond_series].reset_index(drop=True))
        if self.named_output_discarded:
            data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    cond_series = self._condition_expression.eval(df)
    if self.discard_matching_rows:
        cond_series = ~cond_series
    self._set_output_df(data, df[cond_series].reset_index(drop=True))
    if self.named_output_discarded:
        data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
IfThenElseRule (IfThenElseRule)
Source code in etlrules/backends/dask/conditions.py
class IfThenElseRule(IfThenElseRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename=f'{self.output_column}.py')

    def apply(self, data):
        df = self._get_input_df(data)
        df_columns = set(df.columns)
        self._validate_columns(df_columns)
        cond_series = self._condition_expression.eval(df)
        then_value = self.then_value if self.then_value is not None else df[self.then_column]
        else_value = self.else_value if self.else_value is not None else df[self.else_column]
        df = df.assign(**{self.output_column: then_value})
        df = df.assign(**{self.output_column: df[self.output_column].where(cond_series, else_value)})
        if (isinstance(then_value, str) or isinstance(else_value, str)) and len(df.index) == 0:
            df = df.astype({self.output_column: "string"})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    df_columns = set(df.columns)
    self._validate_columns(df_columns)
    cond_series = self._condition_expression.eval(df)
    then_value = self.then_value if self.then_value is not None else df[self.then_column]
    else_value = self.else_value if self.else_value is not None else df[self.else_column]
    df = df.assign(**{self.output_column: then_value})
    df = df.assign(**{self.output_column: df[self.output_column].where(cond_series, else_value)})
    if (isinstance(then_value, str) or isinstance(else_value, str)) and len(df.index) == 0:
        df = df.astype({self.output_column: "string"})
    self._set_output_df(data, df)

datetime

DateTimeLocalNowRule (DateTimeLocalNowRule)
Source code in etlrules/backends/dask/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.assign(**{self.output_column: datetime.datetime.now()})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.assign(**{self.output_column: datetime.datetime.now()})
    self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
Source code in etlrules/backends/dask/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
    self._set_output_df(data, df)

io special

db
WriteSQLTableRule (WriteSQLTableRule)
Source code in etlrules/backends/dask/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):

    METHOD = 'multi'

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        import sqlalchemy as sa
        try:
            df.to_sql(
                self._get_sql_table(),
                self._get_sql_engine(),
                if_exists=self.if_exists,
                index=False,
                method=self.METHOD
            )
        except sa.exc.SQLAlchemyError as exc:
            raise SQLError(str(exc))
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/io/db.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    import sqlalchemy as sa
    try:
        df.to_sql(
            self._get_sql_table(),
            self._get_sql_engine(),
            if_exists=self.if_exists,
            index=False,
            method=self.METHOD
        )
    except sa.exc.SQLAlchemyError as exc:
        raise SQLError(str(exc))

newcolumns

AddNewColumnRule (AddNewColumnRule)
Source code in etlrules/backends/dask/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):

    def get_column_expression(self):
        return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        result = self._column_expression.eval(df)
        if self.column_type:
            try:
                result = result.astype(MAP_TYPES[self.column_type])
            except ValueError as exc:
                raise TypeError(str(exc))
        df = df.assign(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    result = self._column_expression.eval(df)
    if self.column_type:
        try:
            result = result.astype(MAP_TYPES[self.column_type])
        except ValueError as exc:
            raise TypeError(str(exc))
    df = df.assign(**{self.output_column: result})
    self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
Source code in etlrules/backends/dask/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        stop = self.start + len(df.index) * self.step
        result = da.arange(self.start, stop, self.step)
        df = df.assign(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    stop = self.start + len(df.index) * self.step
    result = da.arange(self.start, stop, self.step)
    df = df.assign(**{self.output_column: result})
    self._set_output_df(data, df)

strings

StrExtractRule (StrExtractRule, DaskMixin)
Source code in etlrules/backends/dask/strings.py
class StrExtractRule(StrExtractRuleBase, DaskMixin):
    def apply(self, data):
        df = self._get_input_df(data)
        columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
        new_cols_dict = {}
        groups = self._compiled_expr.groups
        for idx, col in enumerate(columns):
            new_col = df[col].str.extract(self._compiled_expr, expand=True)
            for group in range(groups):
                new_column = new_col[group]
                if group == 0 and self.keep_original_value:
                    # only the first new column keeps the value (in case of multiple groups)
                    new_column = new_column.fillna(value=df[col])
                new_cols_dict[output_columns[idx * groups + group]] = new_column
        df = df.assign(**new_cols_dict)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/dask/strings.py
def apply(self, data):
    df = self._get_input_df(data)
    columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
    new_cols_dict = {}
    groups = self._compiled_expr.groups
    for idx, col in enumerate(columns):
        new_col = df[col].str.extract(self._compiled_expr, expand=True)
        for group in range(groups):
            new_column = new_col[group]
            if group == 0 and self.keep_original_value:
                # only the first new column keeps the value (in case of multiple groups)
                new_column = new_column.fillna(value=df[col])
            new_cols_dict[output_columns[idx * groups + group]] = new_column
    df = df.assign(**new_cols_dict)
    self._set_output_df(data, df)

pandas special

basic

ExplodeValuesRule (ExplodeValuesRule)
Source code in etlrules/backends/pandas/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_input_column(df)
        result = df.explode(self.input_column, ignore_index=True)
        if self.column_type:
            result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
        self._set_output_df(data, result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/basic.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_input_column(df)
    result = df.explode(self.input_column, ignore_index=True)
    if self.column_type:
        result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
    self._set_output_df(data, result)

conditions

FilterRule (FilterRule)
Source code in etlrules/backends/pandas/conditions.py
class FilterRule(FilterRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename="FilterRule.py")

    def apply(self, data):
        df = self._get_input_df(data)
        cond_series = self._condition_expression.eval(df)
        if self.discard_matching_rows:
            cond_series = ~cond_series
        self._set_output_df(data, df[cond_series].reset_index(drop=True))
        if self.named_output_discarded:
            data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    cond_series = self._condition_expression.eval(df)
    if self.discard_matching_rows:
        cond_series = ~cond_series
    self._set_output_df(data, df[cond_series].reset_index(drop=True))
    if self.named_output_discarded:
        data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
IfThenElseRule (IfThenElseRule)
Source code in etlrules/backends/pandas/conditions.py
class IfThenElseRule(IfThenElseRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename=f'{self.output_column}.py')

    def apply(self, data):
        df = self._get_input_df(data)
        df_columns = set(df.columns)
        self._validate_columns(df_columns)
        cond_series = self._condition_expression.eval(df)
        then_value = self.then_value if self.then_value is not None else df[self.then_column]
        else_value = self.else_value if self.else_value is not None else df[self.else_column]
        result = np.where(cond_series, then_value, else_value)
        df = df.assign(**{self.output_column: result})
        if df.empty and (isinstance(then_value, str) or isinstance(else_value, str)):
            df = df.astype({self.output_column: "string"})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    df_columns = set(df.columns)
    self._validate_columns(df_columns)
    cond_series = self._condition_expression.eval(df)
    then_value = self.then_value if self.then_value is not None else df[self.then_column]
    else_value = self.else_value if self.else_value is not None else df[self.else_column]
    result = np.where(cond_series, then_value, else_value)
    df = df.assign(**{self.output_column: result})
    if df.empty and (isinstance(then_value, str) or isinstance(else_value, str)):
        df = df.astype({self.output_column: "string"})
    self._set_output_df(data, df)

datetime

DateTimeLocalNowRule (DateTimeLocalNowRule)
Source code in etlrules/backends/pandas/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.assign(**{self.output_column: datetime.datetime.now()})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.assign(**{self.output_column: datetime.datetime.now()})
    self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
Source code in etlrules/backends/pandas/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
    self._set_output_df(data, df)

io special

db
WriteSQLTableRule (WriteSQLTableRule)
Source code in etlrules/backends/pandas/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):

    METHOD = 'multi'

    def _do_apply(self, connection, df):
        df.to_sql(
            self._get_sql_table(),
            connection,
            if_exists=self.if_exists,
            index=False,
            method=self.METHOD
        )

    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        engine = SQLAlchemyEngines.get_engine(self._get_sql_engine())
        import sqlalchemy as sa
        with engine.connect() as connection:
            try:
                self._do_apply(connection, df)
            except sa.exc.SQLAlchemyError as exc:
                raise SQLError(str(exc))
            connection.commit()
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/io/db.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    engine = SQLAlchemyEngines.get_engine(self._get_sql_engine())
    import sqlalchemy as sa
    with engine.connect() as connection:
        try:
            self._do_apply(connection, df)
        except sa.exc.SQLAlchemyError as exc:
            raise SQLError(str(exc))
        connection.commit()

newcolumns

AddNewColumnRule (AddNewColumnRule)
Source code in etlrules/backends/pandas/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):

    def get_column_expression(self):
        return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        result = self._column_expression.eval(df)
        if self.column_type:
            try:
                result = result.astype(MAP_TYPES[self.column_type])
            except ValueError as exc:
                raise TypeError(str(exc))
        df = df.assign(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    result = self._column_expression.eval(df)
    if self.column_type:
        try:
            result = result.astype(MAP_TYPES[self.column_type])
        except ValueError as exc:
            raise TypeError(str(exc))
    df = df.assign(**{self.output_column: result})
    self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
Source code in etlrules/backends/pandas/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        stop = self.start + df.shape[0] * self.step
        result = np.arange(start=self.start, stop=stop, step=self.step)
        df = df.assign(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    stop = self.start + df.shape[0] * self.step
    result = np.arange(start=self.start, stop=stop, step=self.step)
    df = df.assign(**{self.output_column: result})
    self._set_output_df(data, df)

strings

StrExtractRule (StrExtractRule, PandasMixin)
Source code in etlrules/backends/pandas/strings.py
class StrExtractRule(StrExtractRuleBase, PandasMixin):
    def apply(self, data):
        df = self._get_input_df(data)
        columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
        new_cols_dict = {}
        groups = self._compiled_expr.groups
        for idx, col in enumerate(columns):
            new_col = df[col].str.extract(self._compiled_expr, expand=True)
            if self.keep_original_value:
                # only the first new column keeps the value (in case of multiple groups)
                new_col[0].fillna(value=df[col], inplace=True)
            for group in range(groups):
                new_cols_dict[output_columns[idx * groups + group]] = new_col[group]
        df = df.assign(**new_cols_dict)
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/pandas/strings.py
def apply(self, data):
    df = self._get_input_df(data)
    columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
    new_cols_dict = {}
    groups = self._compiled_expr.groups
    for idx, col in enumerate(columns):
        new_col = df[col].str.extract(self._compiled_expr, expand=True)
        if self.keep_original_value:
            # only the first new column keeps the value (in case of multiple groups)
            new_col[0].fillna(value=df[col], inplace=True)
        for group in range(groups):
            new_cols_dict[output_columns[idx * groups + group]] = new_col[group]
    df = df.assign(**new_cols_dict)
    self._set_output_df(data, df)

polars special

basic

ExplodeValuesRule (ExplodeValuesRule)
Source code in etlrules/backends/polars/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_input_column(df)
        result = df.explode(self.input_column)
        if self.column_type:
            result = result.with_columns(
                **{self.input_column: pl.col(self.input_column).cast(MAP_TYPES[self.column_type])}
            )
        self._set_output_df(data, result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/basic.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_input_column(df)
    result = df.explode(self.input_column)
    if self.column_type:
        result = result.with_columns(
            **{self.input_column: pl.col(self.input_column).cast(MAP_TYPES[self.column_type])}
        )
    self._set_output_df(data, result)

conditions

FilterRule (FilterRule)
Source code in etlrules/backends/polars/conditions.py
class FilterRule(FilterRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename="FilterRule.py")

    def apply(self, data):
        df = self._get_input_df(data)
        try:
            cond_series = self._condition_expression.eval(df)
        except pl.exceptions.ColumnNotFoundError as exc:
            raise KeyError(str(exc))
        if self.discard_matching_rows:
            cond_series = ~cond_series
        result = df.filter(cond_series)
        self._set_output_df(data, result)
        if self.named_output_discarded:
            discarded_result = df.filter(~cond_series)
            data.set_named_output(self.named_output_discarded, discarded_result)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    try:
        cond_series = self._condition_expression.eval(df)
    except pl.exceptions.ColumnNotFoundError as exc:
        raise KeyError(str(exc))
    if self.discard_matching_rows:
        cond_series = ~cond_series
    result = df.filter(cond_series)
    self._set_output_df(data, result)
    if self.named_output_discarded:
        discarded_result = df.filter(~cond_series)
        data.set_named_output(self.named_output_discarded, discarded_result)
IfThenElseRule (IfThenElseRule)
Source code in etlrules/backends/polars/conditions.py
class IfThenElseRule(IfThenElseRuleBase):

    def get_condition_expression(self):
        return Expression(self.condition_expression, filename=f'{self.output_column}.py')

    def apply(self, data):
        df = self._get_input_df(data)
        df_columns = set(df.columns)
        self._validate_columns(df_columns)
        try:
            cond_series = self._condition_expression.eval(df)
        except pl.exceptions.ColumnNotFoundError as exc:
            raise KeyError(str(exc))
        then_value = pl.lit(self.then_value) if self.then_value is not None else pl.col(self.then_column)
        else_value = pl.lit(self.else_value) if self.else_value is not None else pl.col(self.else_column)
        result = pl.when(cond_series).then(then_value).otherwise(else_value)
        df = df.with_columns(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/conditions.py
def apply(self, data):
    df = self._get_input_df(data)
    df_columns = set(df.columns)
    self._validate_columns(df_columns)
    try:
        cond_series = self._condition_expression.eval(df)
    except pl.exceptions.ColumnNotFoundError as exc:
        raise KeyError(str(exc))
    then_value = pl.lit(self.then_value) if self.then_value is not None else pl.col(self.then_column)
    else_value = pl.lit(self.else_value) if self.else_value is not None else pl.col(self.else_column)
    result = pl.when(cond_series).then(then_value).otherwise(else_value)
    df = df.with_columns(**{self.output_column: result})
    self._set_output_df(data, df)

datetime

DateTimeLocalNowRule (DateTimeLocalNowRule)
Source code in etlrules/backends/polars/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.with_columns(
            pl.lit(datetime.datetime.now()).alias(self.output_column)
        )
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.with_columns(
        pl.lit(datetime.datetime.now()).alias(self.output_column)
    )
    self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
Source code in etlrules/backends/polars/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        if self.strict and self.output_column in df.columns:
            raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
        df = df.with_columns(
            pl.lit(datetime.datetime.utcnow()).alias(self.output_column)
        )
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/datetime.py
def apply(self, data):
    df = self._get_input_df(data)
    if self.strict and self.output_column in df.columns:
        raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
    df = df.with_columns(
        pl.lit(datetime.datetime.utcnow()).alias(self.output_column)
    )
    self._set_output_df(data, df)

io special

db
WriteSQLTableRule (WriteSQLTableRule)
Source code in etlrules/backends/polars/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):
    def apply(self, data):
        super().apply(data)
        df = self._get_input_df(data)
        import sqlalchemy as sa
        try:
            df.write_database(
                self._get_sql_table(),
                self._get_sql_engine(),
                if_exists=self.if_exists
            )
        except sa.exc.SQLAlchemyError as exc:
            raise SQLError(str(exc))
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/io/db.py
def apply(self, data):
    super().apply(data)
    df = self._get_input_df(data)
    import sqlalchemy as sa
    try:
        df.write_database(
            self._get_sql_table(),
            self._get_sql_engine(),
            if_exists=self.if_exists
        )
    except sa.exc.SQLAlchemyError as exc:
        raise SQLError(str(exc))

newcolumns

AddNewColumnRule (AddNewColumnRule)
Source code in etlrules/backends/polars/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):

    def get_column_expression(self):
        return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        try:
            result = self._column_expression.eval(df)
        except pl.exceptions.ColumnNotFoundError as exc:
            raise KeyError(str(exc))
        if self.column_type:
            try:
                result = result.cast(MAP_TYPES[self.column_type])
            except pl.exceptions.ComputeError as exc:
                raise TypeError(exc)
        df = df.with_columns(**{self.output_column: result})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    try:
        result = self._column_expression.eval(df)
    except pl.exceptions.ColumnNotFoundError as exc:
        raise KeyError(str(exc))
    if self.column_type:
        try:
            result = result.cast(MAP_TYPES[self.column_type])
        except pl.exceptions.ComputeError as exc:
            raise TypeError(exc)
    df = df.with_columns(**{self.output_column: result})
    self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
Source code in etlrules/backends/polars/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):

    def apply(self, data):
        df = self._get_input_df(data)
        self._validate_columns(df.columns)
        stop = self.start + len(df) * self.step
        df = df.with_columns(**{self.output_column: pl.arange(start=self.start, end=stop, step=self.step)})
        self._set_output_df(data, df)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/newcolumns.py
def apply(self, data):
    df = self._get_input_df(data)
    self._validate_columns(df.columns)
    stop = self.start + len(df) * self.step
    df = df.with_columns(**{self.output_column: pl.arange(start=self.start, end=stop, step=self.step)})
    self._set_output_df(data, df)

strings

StrExtractRule (StrExtractRule, PolarsMixin)
Source code in etlrules/backends/polars/strings.py
class StrExtractRule(StrExtractRuleBase, PolarsMixin):
    def apply(self, data):
        df = self._get_input_df(data)
        columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
        groups = self._compiled_expr.groups
        input_column = columns[0]
        ordered_cols = [col for col in df.columns]
        ordered_cols += [col for col in output_columns if col not in ordered_cols]
        if self.keep_original_value:
            res = df.with_columns(
                pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
            ).select(
                *([col for col in df.columns] + [pl.col("_tmp_col").struct[i].alias("_tmp_col2" if i == 0 else output_columns[i]) for i in range(groups)])
            )
            res= res.with_columns(
                pl.when(
                    pl.col("_tmp_col2").is_null()
                ).then(pl.col(input_column)).otherwise(pl.col("_tmp_col2")).alias(output_columns[0])
            )
        else:
            res = df.with_columns(
                pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
            ).select(
                *([col for col in df.columns if col not in output_columns] + [pl.col("_tmp_col").struct[i].alias(output_columns[i]) for i in range(groups)])
            )

        res = res[ordered_cols]
        self._set_output_df(data, res)
apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/backends/polars/strings.py
def apply(self, data):
    df = self._get_input_df(data)
    columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
    groups = self._compiled_expr.groups
    input_column = columns[0]
    ordered_cols = [col for col in df.columns]
    ordered_cols += [col for col in output_columns if col not in ordered_cols]
    if self.keep_original_value:
        res = df.with_columns(
            pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
        ).select(
            *([col for col in df.columns] + [pl.col("_tmp_col").struct[i].alias("_tmp_col2" if i == 0 else output_columns[i]) for i in range(groups)])
        )
        res= res.with_columns(
            pl.when(
                pl.col("_tmp_col2").is_null()
            ).then(pl.col(input_column)).otherwise(pl.col("_tmp_col2")).alias(output_columns[0])
        )
    else:
        res = df.with_columns(
            pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
        ).select(
            *([col for col in df.columns if col not in output_columns] + [pl.col("_tmp_col").struct[i].alias(output_columns[i]) for i in range(groups)])
        )

    res = res[ordered_cols]
    self._set_output_df(data, res)

engine

RuleEngine

Run a set of extract/transform/load rules over a dataframe.

Takes in a plan with the definition of the extract/transform/load rules and it runs it over a RuleData instance. The RuleData instance can be optionally pre-populated with a input dataframe (in pipeline mode) or a sequence of named inputs (named dataframes).

The plan can have rules to extract data (ie add more dataframes to the RuleData). It can have transform rules which will transform the existing dataframes (either in-place or produce new named dataframes). It can also have rules to load data into external systems, e.g. files, databases, API connections, etc.

At the end of a plan run, the RuleData instance passed in will contain the results of the run (ie new dataframes/transformed dataframes) which can be inspected/operated on outside of the rule engine.

Source code in etlrules/engine.py
class RuleEngine:
    """ Run a set of extract/transform/load rules over a dataframe.

    Takes in a plan with the definition of the extract/transform/load rules and it
    runs it over a RuleData instance. The RuleData instance can be optionally pre-populated with
    a input dataframe (in pipeline mode) or a sequence of named inputs (named dataframes).

    The plan can have rules to extract data (ie add more dataframes to the RuleData). It can have
    transform rules which will transform the existing dataframes (either in-place or produce new
    named dataframes). It can also have rules to load data into external systems, e.g. files,
    databases, API connections, etc.

    At the end of a plan run, the RuleData instance passed in will contain the results of the run
    (ie new dataframes/transformed dataframes) which can be inspected/operated on outside of the
    rule engine.
    """

    def __init__(self, plan: Plan):
        assert isinstance(plan, Plan)
        self.plan = plan

    def _get_context(self, data: RuleData) -> dict[str, Union[str, int, float, bool]]:
        context = {}
        context.update(self.plan.get_context())
        context.update(data.get_context())
        return context

    def run_pipeline(self, data: RuleData) -> RuleData:
        with context.set(self._get_context(data)):
            for rule in self.plan:
                rule.apply(data)
        return data

    def _get_topological_sorter(self, data: RuleData) -> graphlib.TopologicalSorter:
        g = graphlib.TopologicalSorter()
        existing_named_outputs = set(name for name, _ in data.get_named_outputs())
        named_outputs = {}
        for idx, rule in enumerate(self.plan):
            if rule.has_output():
                named_outputs_lst = list(rule.get_all_named_outputs())
                if not named_outputs_lst:
                    raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has no named outputs.")
                for named_output in named_outputs_lst:
                    if named_output is None:
                        raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has empty named output.")
                    existing_rule = named_outputs.get(named_output)
                    if existing_rule is not None:  
                        raise InvalidPlanError(f"Named output '{named_output}' is produced by multiple rules: {rule.__class__}/(name={rule.get_name()}) and {existing_rule[1].__class__}/(name={existing_rule[1].get_name()})")
                    named_outputs[named_output] = (idx, rule)
        named_output_clashes = existing_named_outputs & set(named_outputs.keys())
        if named_output_clashes:
            raise GraphRuntimeError(f"Named output clashes. The following named outputs are produced by rules in the plan but they also exist in the input data, leading to ambiguity: {named_output_clashes}")
        for idx, rule in enumerate(self.plan):
            if rule.has_input():
                named_inputs = list(rule.get_all_named_inputs())
                if not named_inputs:
                    raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has no named inputs.")
                for named_input in named_inputs:
                    if named_input is None:
                        raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has empty named input.")
                    if named_input in named_outputs:
                        g.add(idx, named_outputs[named_input][0])
                    elif named_input not in existing_named_outputs:
                        raise GraphRuntimeError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) requires a named_input={named_input} which doesn't exist in the input data and it's not produced as a named output by any of the rules in the graph.")
                    else:
                        g.add(idx)
            else:
                g.add(idx)
        return g

    def run_graph(self, data: RuleData) -> RuleData:
        g = self._get_topological_sorter(data)
        g.prepare()
        with context.set(self._get_context(data)):
            while g.is_active():
                for rule_idx in g.get_ready():
                    rule = self.plan.get_rule(rule_idx)
                    rule.apply(data)
                    g.done(rule_idx)
        return data

    def validate_pipeline(self, data: RuleData) -> Tuple[bool, Optional[str]]:
        return True, None

    def validate_graph(self, data: RuleData) -> Tuple[bool, Optional[str]]:
        try:
            self._get_topological_sorter(data)
        except (InvalidPlanError, GraphRuntimeError) as exc:
            return False, str(exc)
        return True, None

    def validate(self, data: RuleData) -> Tuple[bool, Optional[str]]:
        assert isinstance(data, RuleData)
        if self.plan.is_empty():
            return False, "An empty plan cannot be run."
        mode = self.plan.get_mode()
        if mode == PlanMode.PIPELINE:
            return self.validate_pipeline(data)
        elif mode == PlanMode.GRAPH:
            return self.validate_graph(data)
        return False, "Plan's mode cannot be determined."

    def run(self, data: RuleData) -> RuleData:
        assert isinstance(data, RuleData)
        if self.plan.is_empty():
            raise InvalidPlanError("An empty plan cannot be run.")
        mode = self.plan.get_mode()
        if mode == PlanMode.PIPELINE:
            return self.run_pipeline(data)
        elif mode == PlanMode.GRAPH:
            return self.run_graph(data)
        else:
            raise InvalidPlanError("Plan's mode cannot be determined.")

exceptions

ColumnAlreadyExistsError (Exception)

An attempt to create a column that already exists in the dataframe.

Source code in etlrules/exceptions.py
class ColumnAlreadyExistsError(Exception):
    """ An attempt to create a column that already exists in the dataframe. """

ExpressionSyntaxError (SyntaxError)

A Python expression used to create a column, aggregate or other operations has a syntax error.

Source code in etlrules/exceptions.py
class ExpressionSyntaxError(SyntaxError):
    """ A Python expression used to create a column, aggregate or other operations has a syntax error. """

GraphRuntimeError (RuntimeError)

There was an error when running a graph-mode plan.

Source code in etlrules/exceptions.py
class GraphRuntimeError(RuntimeError):
    """ There was an error when running a graph-mode plan. """

InvalidPlanError (Exception)

The plan failed validation.

Source code in etlrules/exceptions.py
class InvalidPlanError(Exception):
    """ The plan failed validation. """

MissingColumnError (Exception)

An operation is being applied to a column that is not present in the input data frame.

Source code in etlrules/exceptions.py
class MissingColumnError(Exception):
    """ An operation is being applied to a column that is not present in the input data frame. """

SQLError (RuntimeError)

There was an error during the execution of a sql statement.

Source code in etlrules/exceptions.py
class SQLError(RuntimeError):
    """ There was an error during the execution of a sql statement. """

SchemaError (Exception)

An operation needs a certain schema for the dataframe which is not present.

Source code in etlrules/exceptions.py
class SchemaError(Exception):
    """ An operation needs a certain schema for the dataframe which is not present. """

UnsupportedTypeError (Exception)

A type conversion is attempted to a type that is not supported.

Source code in etlrules/exceptions.py
class UnsupportedTypeError(Exception):
    """ A type conversion is attempted to a type that is not supported. """

plan

Plan

A plan to manipulate one or multiple dataframes with a set of rules.

A plan is a blueprint on how to extract one or more dataframes from various sources (e.g. files or other data sources), how to transform those dataframes by adding calculated columns, joining different dataframe, aggregating, sorting, etc. and ultimately how to load that into a data store (files or other data stores).

A plan can operate in two modes: pipeline or graph. A pipeline graph is a simple type of plan where each rule take its input from the previous rule's output. A graph plan is more complex as it allows rules to produce named outputs which can then be used by other rules. This ultimately builds a dag (directed acyclic graph) of rule dependencies. A graph allows branching and joining back allowing complex logic. Rules are executed in the order of dependency and not in the order they are added to the plan. By comparison, pipelines implement a single input/single output mode where rules are executed in the order they are added to the plan.

Pipeline example::

1
2
3
4
plan = Plan()
plan.add_rule(SortRule(['A']))
plan.add_rule(ProjectRule(['A', 'B']))
plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}))

Graph example::

1
2
3
4
plan = Plan()
plan.add_rule(SortRule(['A'], named_input="input", named_output="sorted_data"))
plan.add_rule(ProjectRule(['A', 'B'], named_input="sorted_data", named_output="projected_data"))
plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}, named_input="projected_data", named_output="renamed_data"))

Note

Rules that are used in graph mode should take a named_input and produce a named_output. Rules that use the pipeline mode must not used named inputs/outputs. The two type of rules cannot be used in the same plan as that leads to ambiguity.

Parameters:

Name Type Description Default
mode Optional[Literal['pipeline', 'graph']]

One of pipeline or graph, the type of the graph. Optional. In pipeline mode, rules don't use named inputs/outputs and they are run in the same order they are added to the plan, with each rule taking the input from the previous rule. In graph mode, rules use named inputs/outputs which create a directed acyclical graph of dependency. The rules are run in the order of dependency.

When not specified, it is inferred from the first rule in the plan.

None
name Optional[str]

A name for the plan. Optional.

None
description Optional[str]

An optional documentation for the plan. This can include what the plan does, its purpose and detailed information about how it works.

None
context Optional[Mapping[str, Union[str, int, float, bool]]]

An optional key-value mapping which can be used in rules via string substitutions. It can be used as arguments into the plan to tweak the running of the plan by providing different values for certain arguments with each run. The types of the values can be: strings, int, float, boolean (True or False).

None
strict Optional[bool]

A hint about how the plan should be executed. When None, then the plan has no hint to provide and its the caller deciding whether to run it in a strict mode or not.

None

Exceptions:

Type Description
InvalidPlanError

if pipeline mode rules are mixed with graph mode rules

Source code in etlrules/plan.py
class Plan:
    """ A plan to manipulate one or multiple dataframes with a set of rules.

    A plan is a blueprint on how to extract one or more dataframes from various sources (e.g. files or
    other data sources), how to transform those dataframes by adding calculated columns, joining
    different dataframe, aggregating, sorting, etc. and ultimately how to load that into a data store
    (files or other data stores).

    A plan can operate in two modes: pipeline or graph. A pipeline graph is a simple type of plan where
    each rule take its input from the previous rule's output. A graph plan is more complex as it allows
    rules to produce named outputs which can then be used by other rules. This ultimately builds a dag
    (directed acyclic graph) of rule dependencies. A graph allows branching and joining back allowing
    complex logic. Rules are executed in the order of dependency and not in the order they are added to
    the plan. By comparison, pipelines implement a single input/single output mode where rules are
    executed in the order they are added to the plan.

    Pipeline example::

        plan = Plan()
        plan.add_rule(SortRule(['A']))
        plan.add_rule(ProjectRule(['A', 'B']))
        plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}))

    Graph example::

        plan = Plan()
        plan.add_rule(SortRule(['A'], named_input="input", named_output="sorted_data"))
        plan.add_rule(ProjectRule(['A', 'B'], named_input="sorted_data", named_output="projected_data"))
        plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}, named_input="projected_data", named_output="renamed_data"))

    Note:
        Rules that are used in graph mode should take a named_input and produce a named_output. Rules
        that use the pipeline mode must not used named inputs/outputs. The two type of rules cannot be
        used in the same plan as that leads to ambiguity.

    Args:
        mode: One of pipeline or graph, the type of the graph. Optional.
            In pipeline mode, rules don't use named inputs/outputs and they are run in the same order they are
            added to the plan, with each rule taking the input from the previous rule.
            In graph mode, rules use named inputs/outputs which create a directed acyclical graph of
            dependency. The rules are run in the order of dependency.

            When not specified, it is inferred from the first rule in the plan.
        name: A name for the plan. Optional.
        description: An optional documentation for the plan.
            This can include what the plan does, its purpose and detailed information about how it works.
        context: An optional key-value mapping which can be used in rules via string substitutions.
            It can be used as arguments into the plan to tweak the running of the plan by providing different
            values for certain arguments with each run.
            The types of the values can be: strings, int, float, boolean (True or False).
        strict: A hint about how the plan should be executed.
            When None, then the plan has no hint to provide and its the caller deciding whether to run it
            in a strict mode or not.

    Raises:
        InvalidPlanError: if pipeline mode rules are mixed with graph mode rules
    """

    def __init__(
        self,
        mode: Optional[Literal['pipeline', 'graph']]=None,
        name: Optional[str]=None,
        description: Optional[str]=None,
        context: Optional[Mapping[str, Union[str, int, float, bool]]]=None,
        strict: Optional[bool]=None
    ):
        self.mode = mode
        self.name = name
        self.description = description
        self.context = {k: v for k, v in context.items()} if context is not None else {}
        self.strict = strict
        self.rules = []

    def _check_plan_mode(self, rule: BaseRule):
        mode = self.get_mode()
        if mode is not None:
            _new_rule_mode = plan_mode_from_rule(rule)
            if _new_rule_mode is not None and mode != _new_rule_mode:
                raise InvalidPlanError(f"Mixing of rules taking named inputs and rules with no named inputs is not supported. ({mode} vs. {rule.__class__}'s mode {_new_rule_mode})")

    def get_mode(self) -> Optional[Literal['pipeline', 'graph']]:
        """ Return the mode (pipeline or graph) of the plan. """
        if self.mode is None:
            self.mode = plan_mode_from_rules(self.rules)
        return self.mode

    def get_context(self) -> dict[str, Union[str, int, float, bool]]:
        return self.context

    def add_rule(self, rule: BaseRule) -> None:
        """ Add a new rule to the plan.

        Args:
            rule: A rule instance to add to the plan

        Raises:
            InvalidPlanError: if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them)
        """
        assert isinstance(rule, BaseRule)
        self._check_plan_mode(rule)
        self.rules.append(rule)

    def __iter__(self):
        yield from self.rules

    def get_rule(self, idx: int) -> BaseRule:
        """ Return the rule at a certain index as per order of addition to the plan. """
        return self.rules[idx]

    def is_empty(self) -> bool:
        """ Return True if the plan has no rules, False otherwise.

        Returns:
            A boolean to indicate if the plan is empty.
        """
        return not self.rules

    def to_dict(self) -> dict:
        """ Serialize the plan to a dict.

        Returns:
            A dictionary with the plan representation.
        """
        rules = [rule.to_dict() for rule in self.rules]
        return {
            "name": self.name,
            "description": self.description,
            "context": self.context,
            "strict": self.strict,
            "rules": rules
        }

    @classmethod
    def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
        """ Creates a plan instance from a python dictionary.

        Args:
            dct: A dictionary to create the plan from
            backend: One of the supported backends (ie pandas)
            additional_packages: Optional list of other packages to look for rules in
        Returns:
            A new instance of a Plan.
        """
        instance = Plan(
            name=dct.get("name"),
            description=dct.get("description"),
            context=dct.get("context"),
            strict=dct.get("strict"),
        )
        rules = dct.get("rules", ())
        for rule in rules:
            instance.add_rule(BaseRule.from_dict(rule, backend, additional_packages))
        return instance

    def to_yaml(self) -> str:
        """ Serialize the plan to yaml. """
        return yaml.safe_dump(self.to_dict())

    @classmethod
    def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
        """ Creates a plan from a yaml definition.

        Args:
            yml: The yaml string to create the plan from
            backend: A supported backend (ie pandas)
            additional_packages: Optional list of other packages to look for rules in

        Returns:
            A new instance of a Plan.
        """
        dct = yaml.safe_load(yml)
        return cls.from_dict(dct, backend, additional_packages)

    def __eq__(self, other: 'Plan') -> bool:
        return (
            type(self) == type(other) and 
            self.name == other.name and self.description == other.description and
            self.strict == other.strict and self.rules == other.rules
        )

add_rule(self, rule)

Add a new rule to the plan.

Parameters:

Name Type Description Default
rule BaseRule

A rule instance to add to the plan

required

Exceptions:

Type Description
InvalidPlanError

if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them)

Source code in etlrules/plan.py
def add_rule(self, rule: BaseRule) -> None:
    """ Add a new rule to the plan.

    Args:
        rule: A rule instance to add to the plan

    Raises:
        InvalidPlanError: if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them)
    """
    assert isinstance(rule, BaseRule)
    self._check_plan_mode(rule)
    self.rules.append(rule)

from_dict(dct, backend, additional_packages=None) classmethod

Creates a plan instance from a python dictionary.

Parameters:

Name Type Description Default
dct dict

A dictionary to create the plan from

required
backend str

One of the supported backends (ie pandas)

required
additional_packages Optional[Sequence[str]]

Optional list of other packages to look for rules in

None

Returns:

Type Description
Plan

A new instance of a Plan.

Source code in etlrules/plan.py
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
    """ Creates a plan instance from a python dictionary.

    Args:
        dct: A dictionary to create the plan from
        backend: One of the supported backends (ie pandas)
        additional_packages: Optional list of other packages to look for rules in
    Returns:
        A new instance of a Plan.
    """
    instance = Plan(
        name=dct.get("name"),
        description=dct.get("description"),
        context=dct.get("context"),
        strict=dct.get("strict"),
    )
    rules = dct.get("rules", ())
    for rule in rules:
        instance.add_rule(BaseRule.from_dict(rule, backend, additional_packages))
    return instance

from_yaml(yml, backend, additional_packages=None) classmethod

Creates a plan from a yaml definition.

Parameters:

Name Type Description Default
yml str

The yaml string to create the plan from

required
backend str

A supported backend (ie pandas)

required
additional_packages Optional[Sequence[str]]

Optional list of other packages to look for rules in

None

Returns:

Type Description
Plan

A new instance of a Plan.

Source code in etlrules/plan.py
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
    """ Creates a plan from a yaml definition.

    Args:
        yml: The yaml string to create the plan from
        backend: A supported backend (ie pandas)
        additional_packages: Optional list of other packages to look for rules in

    Returns:
        A new instance of a Plan.
    """
    dct = yaml.safe_load(yml)
    return cls.from_dict(dct, backend, additional_packages)

get_mode(self)

Return the mode (pipeline or graph) of the plan.

Source code in etlrules/plan.py
def get_mode(self) -> Optional[Literal['pipeline', 'graph']]:
    """ Return the mode (pipeline or graph) of the plan. """
    if self.mode is None:
        self.mode = plan_mode_from_rules(self.rules)
    return self.mode

get_rule(self, idx)

Return the rule at a certain index as per order of addition to the plan.

Source code in etlrules/plan.py
def get_rule(self, idx: int) -> BaseRule:
    """ Return the rule at a certain index as per order of addition to the plan. """
    return self.rules[idx]

is_empty(self)

Return True if the plan has no rules, False otherwise.

Returns:

Type Description
bool

A boolean to indicate if the plan is empty.

Source code in etlrules/plan.py
def is_empty(self) -> bool:
    """ Return True if the plan has no rules, False otherwise.

    Returns:
        A boolean to indicate if the plan is empty.
    """
    return not self.rules

to_dict(self)

Serialize the plan to a dict.

Returns:

Type Description
dict

A dictionary with the plan representation.

Source code in etlrules/plan.py
def to_dict(self) -> dict:
    """ Serialize the plan to a dict.

    Returns:
        A dictionary with the plan representation.
    """
    rules = [rule.to_dict() for rule in self.rules]
    return {
        "name": self.name,
        "description": self.description,
        "context": self.context,
        "strict": self.strict,
        "rules": rules
    }

to_yaml(self)

Serialize the plan to yaml.

Source code in etlrules/plan.py
def to_yaml(self) -> str:
    """ Serialize the plan to yaml. """
    return yaml.safe_dump(self.to_dict())

rule

BaseRule

The base class for all rules.

Derive your custom rules from BaseRule in order to use them in a plan. Implement the following methods as needed: apply: mandatory, it implements the functionality of the rule

defaults to True, override and return False if your rule reads data

into the plan and therefore has no other dataframe input

defaults to True, override and return False if your rule writes data

to a persistent repository and therefore has no dataframe output

get_all_named_inputs: override to return the named inputs (if any) as strings get_all_named_outputs: override in case of multiple named outputs and return them as strings

named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. name (Optional[str]): Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. description (Optional[str]): Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. strict (bool): When set to True, the rule does a stricter valiation. Default: True

Note

Add any class data members to the following list/tuples if needed:

Used in implementing equality between rules. Equality is

mostly used in tests. By default, equality looks at all data members in the class' dict. You can exclude calculated or transient data members which should be excluded from equality. Alternatively, you can implement eq in your own class and not rely on the eq implementation in the base class.

Used to exclude data members from the serialization to

dict and yaml. The serialization is implemented generically in the base class to serialize all data members in the class' dict which do not start with an underscore. See the note on serialization below.

Note

When implementing serialization, the arguments into your class should be saved as they are in data members with the same name as the arguments. This is because the de-serialization passes those as args into the init. As such, make sure to use the same names and to exclude data members which are not in the init from serialization by adding them to EXCLUDE_FROM_SERIALIZE.

Source code in etlrules/rule.py
class BaseRule:
    """ The base class for all rules.

    Derive your custom rules from BaseRule in order to use them in a plan.
    Implement the following methods as needed:
    apply: mandatory, it implements the functionality of the rule
    has_input: defaults to True, override and return False if your rule reads data
        into the plan and therefore has no other dataframe input
    has_output: defaults to True, override and return False if your rule writes data
        to a persistent repository and therefore has no dataframe output
    get_all_named_inputs: override to return the named inputs (if any) as strings
    get_all_named_outputs: override in case of multiple named outputs and return them as strings

    named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
        When not set, the result of this rule will be available as the main output.
        When set to a name (string), the result will be available as that named output.
    name (Optional[str]): Give the rule a name. Optional.
        Named rules are more descriptive as to what they're trying to do/the intent.
    description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
        Together with the name, the description acts as the documentation of the rule.
    strict (bool): When set to True, the rule does a stricter valiation. Default: True

    Note:
        Add any class data members to the following list/tuples if needed:
        EXCLUDE_FROM_COMPARE: Used in implementing equality between rules. Equality is
            mostly used in tests. By default, equality looks at all data members in the
            class' __dict__. You can exclude calculated or transient data members which
            should be excluded from equality. Alternatively, you can implement __eq__ in
            your own class and not rely on the __eq__ implementation in the base class.
        EXCLUDE_FROM_SERIALIZE: Used to exclude data members from the serialization to
            dict and yaml. The serialization is implemented generically in the base class
            to serialize all data members in the class' __dict__ which do not start with
            an underscore. See the note on serialization below.

    Note:
        When implementing serialization, the arguments into your class should be saved as
        they are in data members with the same name as the arguments. This is because the
        de-serialization passes those as args into the __init__. As such, make sure to use
        the same names and to exclude data members which are not in the __init__ from
        serialization by adding them to EXCLUDE_FROM_SERIALIZE.

    """

    EXCLUDE_FROM_COMPARE = ()
    EXCLUDE_FROM_SERIALIZE = ()

    def __init__(self, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        assert named_output is None or isinstance(named_output, str) and named_output
        self.named_output = named_output
        self.name = name
        self.description = description
        self.strict = strict

    def get_name(self) -> Optional[str]:
        """ Returns the name of the rule.

        The name is optional and it can be None.

        The name of the rule should indicate what the rule does and not how it's
        implemented. The names should read like documentation. As such, names like

        Remove duplicate first names from the list of addresses
        Only keep the names starting with A

        are preferable names to:

        DedupeRule
        ProjectRule

        Names are not used internally for anything other than your own (and your 
        end users') documentation, so use what makes sense.
        """
        return self.name

    def get_description(self) -> Optional[str]:
        """ A long description of what the rule does, why and optionally how it does it.

        The description is optional and it can be None.

        Similar to name, this long description acts as documentation for you and your users.
        It's particularly useful if your rule is serialized in a readable format like yaml
        and your users either do not have access to the documentation or they are not technical.

        Unlike the name, which should generally be a single line headline, the description is a
        long, multi-line description of the rule: the what, why, how of the rule.
        """
        return self.description

    def has_input(self) -> bool:
        """ Returns True if the rule needs a dataframe input to operate on, False otherwise.

        By default, it returns True. It should be overriden to return False for those
        rules which read data into the plan. For example, reading a csv file or reading a
        table from the DB. These are operation which do not need an input dataframe to
        operate on as they are sourcing data.
        """
        return True

    def has_output(self) -> bool:
        """ Returns True if the rule produces a dataframe, False otherwise.

        By default, it returns True. It should be overriden to return False for those
        rules which write data out of the plan. For example, writing a file or data into a
        database. These are operations which do not produce an output dataframe into
        the plan as they are writing data outside the plan.
        """
        return True

    def has_named_output(self) -> bool:
        return bool(self.named_output)

    def get_all_named_inputs(self) -> Generator[str, None, None]:
        """ Yields all the named inputs of this rule (as strings).

        By default, it yields nothing as this base rule doesn't store
        information about inputs. Some rules take no input, some take
        one or more inputs. Yield accordingly when you override.
        """
        yield from ()

    def get_all_named_outputs(self) -> Generator[str, None, None]:
        """ Yields all the named outputs of this rule (as strings).

        By default, it yields the single named_output passed into this
        rule as an argument. Some rules produce no output, some produce
        multiple outputs. Yield accordingly when you override.
        """
        yield self.named_output

    def _set_output_df(self, data, df):
        if self.named_output is None:
            data.set_main_output(df)
        else:
            data.set_named_output(self.named_output, df)

    def apply(self, data: RuleData) -> None:
        """ Applies the main rule logic to the input data.

        This is the main rule that applies a rule logic to an input data.
        The input data is an instance of RuleData which can store a single, unnamed
        dataframe (in pipeline mode) or one or many named dataframes (in graph mode).
        The rule extracts the data it needs from the data, applies its main logic
        and updates the same instance of RuleData with the output, if any.

        This method doesn't do anything in the base class other than asserting that
        the data passed in is an instance of RuleData. Override this when you derive
        from BaseRule and implement the logic of your rule.

        """
        assert isinstance(data, RuleData)

    def to_dict(self) -> dict:
        """ Serializes this rule to a python dictionary.

        This is a generic implementation that should work for all derived
        classes and therefore you shouldn't need to override, although you can do so.

        Because it aims to be generic and work correctly for all the derived classes,
        a few assumptions are made and must be respected when you implement your own
        rules derived from BaseRule.

        The class will serialize all the data attributes of a class which do not start with
        underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member
        of the class. As such, to exclude any of your internal data attributes, either named
        them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.

        The serialize will look into a classes __dict__ and therefore the class must have a
        __dict__.

        For the de-serialization to work generically, the name of the attributes must match the
        names of the arguments in the __init__. This is quite an important and restrictive
        constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.

        Note:
            Use the same name for attributes on self as the respective arguments in __init__.

        """
        dct = {
            "name": self.name,
            "description": self.description,
        }
        dct.update({
            attr: value for attr, value in self.__dict__.items() 
                if not attr.startswith("_") and attr not in self.EXCLUDE_FROM_SERIALIZE
                and attr not in dct.keys()
        })
        return {
            self.__class__.__name__: dct
        }

    @classmethod
    def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
        """ Creates a rule instance from a python dictionary.

        Args:
            dct: A dictionary to create the plan from
            backend: One of the supported backends (ie pandas)
            additional_packages: Optional list of other packages to look for rules in
        Returns:
            A new instance of a Plan.
        """
        assert backend and isinstance(backend, str)
        keys = tuple(dct.keys())
        assert len(keys) == 1
        rule_name = keys[0]
        backend_pkgs = [f'etlrules.backends.{backend}']
        for additional_package in additional_packages or ():
            backend_pkgs.append(additional_package)
        modules = [importlib.import_module(backend_pkg, '') for backend_pkg in backend_pkgs]
        for mod in modules:
            clss = getattr(mod, rule_name, None)
            if clss is not None:
                break
        assert clss, f"Cannot find class {rule_name} in packages: {backend_pkgs}"
        if clss is not cls:
            return clss.from_dict(dct, backend, additional_packages)
        return clss(**dct[rule_name])

    def to_yaml(self):
        """ Serialize the rule to yaml. """
        return yaml.safe_dump(self.to_dict())

    @classmethod
    def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
        """ Creates a rule instance from a yaml definition.

        Args:
            yml: The yaml string to create the plan from
            backend: A supported backend (ie pandas)
            additional_packages: Optional list of other packages to look for rules in

        Returns:
            A new instance of a rule.
        """
        dct = yaml.safe_load(yml)
        return cls.from_dict(dct, backend, additional_packages)

    def __eq__(self, other) -> bool:
        return (
            type(self) == type(other) and 
            {k: v for k, v in self.__dict__.items() if k not in self.EXCLUDE_FROM_COMPARE} == 
            {k: v for k, v in other.__dict__.items() if k not in self.EXCLUDE_FROM_COMPARE}
        )

apply(self, data)

Applies the main rule logic to the input data.

This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.

This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.

Source code in etlrules/rule.py
def apply(self, data: RuleData) -> None:
    """ Applies the main rule logic to the input data.

    This is the main rule that applies a rule logic to an input data.
    The input data is an instance of RuleData which can store a single, unnamed
    dataframe (in pipeline mode) or one or many named dataframes (in graph mode).
    The rule extracts the data it needs from the data, applies its main logic
    and updates the same instance of RuleData with the output, if any.

    This method doesn't do anything in the base class other than asserting that
    the data passed in is an instance of RuleData. Override this when you derive
    from BaseRule and implement the logic of your rule.

    """
    assert isinstance(data, RuleData)

from_dict(dct, backend, additional_packages=None) classmethod

Creates a rule instance from a python dictionary.

Parameters:

Name Type Description Default
dct dict

A dictionary to create the plan from

required
backend str

One of the supported backends (ie pandas)

required
additional_packages Optional[Sequence[str]]

Optional list of other packages to look for rules in

None

Returns:

Type Description
BaseRule

A new instance of a Plan.

Source code in etlrules/rule.py
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
    """ Creates a rule instance from a python dictionary.

    Args:
        dct: A dictionary to create the plan from
        backend: One of the supported backends (ie pandas)
        additional_packages: Optional list of other packages to look for rules in
    Returns:
        A new instance of a Plan.
    """
    assert backend and isinstance(backend, str)
    keys = tuple(dct.keys())
    assert len(keys) == 1
    rule_name = keys[0]
    backend_pkgs = [f'etlrules.backends.{backend}']
    for additional_package in additional_packages or ():
        backend_pkgs.append(additional_package)
    modules = [importlib.import_module(backend_pkg, '') for backend_pkg in backend_pkgs]
    for mod in modules:
        clss = getattr(mod, rule_name, None)
        if clss is not None:
            break
    assert clss, f"Cannot find class {rule_name} in packages: {backend_pkgs}"
    if clss is not cls:
        return clss.from_dict(dct, backend, additional_packages)
    return clss(**dct[rule_name])

from_yaml(yml, backend, additional_packages=None) classmethod

Creates a rule instance from a yaml definition.

Parameters:

Name Type Description Default
yml str

The yaml string to create the plan from

required
backend str

A supported backend (ie pandas)

required
additional_packages Optional[Sequence[str]]

Optional list of other packages to look for rules in

None

Returns:

Type Description
BaseRule

A new instance of a rule.

Source code in etlrules/rule.py
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
    """ Creates a rule instance from a yaml definition.

    Args:
        yml: The yaml string to create the plan from
        backend: A supported backend (ie pandas)
        additional_packages: Optional list of other packages to look for rules in

    Returns:
        A new instance of a rule.
    """
    dct = yaml.safe_load(yml)
    return cls.from_dict(dct, backend, additional_packages)

get_all_named_inputs(self)

Yields all the named inputs of this rule (as strings).

By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.

Source code in etlrules/rule.py
def get_all_named_inputs(self) -> Generator[str, None, None]:
    """ Yields all the named inputs of this rule (as strings).

    By default, it yields nothing as this base rule doesn't store
    information about inputs. Some rules take no input, some take
    one or more inputs. Yield accordingly when you override.
    """
    yield from ()

get_all_named_outputs(self)

Yields all the named outputs of this rule (as strings).

By default, it yields the single named_output passed into this rule as an argument. Some rules produce no output, some produce multiple outputs. Yield accordingly when you override.

Source code in etlrules/rule.py
def get_all_named_outputs(self) -> Generator[str, None, None]:
    """ Yields all the named outputs of this rule (as strings).

    By default, it yields the single named_output passed into this
    rule as an argument. Some rules produce no output, some produce
    multiple outputs. Yield accordingly when you override.
    """
    yield self.named_output

get_description(self)

A long description of what the rule does, why and optionally how it does it.

The description is optional and it can be None.

Similar to name, this long description acts as documentation for you and your users. It's particularly useful if your rule is serialized in a readable format like yaml and your users either do not have access to the documentation or they are not technical.

Unlike the name, which should generally be a single line headline, the description is a long, multi-line description of the rule: the what, why, how of the rule.

Source code in etlrules/rule.py
def get_description(self) -> Optional[str]:
    """ A long description of what the rule does, why and optionally how it does it.

    The description is optional and it can be None.

    Similar to name, this long description acts as documentation for you and your users.
    It's particularly useful if your rule is serialized in a readable format like yaml
    and your users either do not have access to the documentation or they are not technical.

    Unlike the name, which should generally be a single line headline, the description is a
    long, multi-line description of the rule: the what, why, how of the rule.
    """
    return self.description

get_name(self)

Returns the name of the rule.

The name is optional and it can be None.

The name of the rule should indicate what the rule does and not how it's implemented. The names should read like documentation. As such, names like

Remove duplicate first names from the list of addresses Only keep the names starting with A

are preferable names to:

DedupeRule ProjectRule

Names are not used internally for anything other than your own (and your end users') documentation, so use what makes sense.

Source code in etlrules/rule.py
def get_name(self) -> Optional[str]:
    """ Returns the name of the rule.

    The name is optional and it can be None.

    The name of the rule should indicate what the rule does and not how it's
    implemented. The names should read like documentation. As such, names like

    Remove duplicate first names from the list of addresses
    Only keep the names starting with A

    are preferable names to:

    DedupeRule
    ProjectRule

    Names are not used internally for anything other than your own (and your 
    end users') documentation, so use what makes sense.
    """
    return self.name

has_input(self)

Returns True if the rule needs a dataframe input to operate on, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.

Source code in etlrules/rule.py
def has_input(self) -> bool:
    """ Returns True if the rule needs a dataframe input to operate on, False otherwise.

    By default, it returns True. It should be overriden to return False for those
    rules which read data into the plan. For example, reading a csv file or reading a
    table from the DB. These are operation which do not need an input dataframe to
    operate on as they are sourcing data.
    """
    return True

has_output(self)

Returns True if the rule produces a dataframe, False otherwise.

By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.

Source code in etlrules/rule.py
def has_output(self) -> bool:
    """ Returns True if the rule produces a dataframe, False otherwise.

    By default, it returns True. It should be overriden to return False for those
    rules which write data out of the plan. For example, writing a file or data into a
    database. These are operations which do not produce an output dataframe into
    the plan as they are writing data outside the plan.
    """
    return True

to_dict(self)

Serializes this rule to a python dictionary.

This is a generic implementation that should work for all derived classes and therefore you shouldn't need to override, although you can do so.

Because it aims to be generic and work correctly for all the derived classes, a few assumptions are made and must be respected when you implement your own rules derived from BaseRule.

The class will serialize all the data attributes of a class which do not start with underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member of the class. As such, to exclude any of your internal data attributes, either named them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.

The serialize will look into a classes dict and therefore the class must have a dict.

For the de-serialization to work generically, the name of the attributes must match the names of the arguments in the init. This is quite an important and restrictive constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.

Note

Use the same name for attributes on self as the respective arguments in init.

Source code in etlrules/rule.py
def to_dict(self) -> dict:
    """ Serializes this rule to a python dictionary.

    This is a generic implementation that should work for all derived
    classes and therefore you shouldn't need to override, although you can do so.

    Because it aims to be generic and work correctly for all the derived classes,
    a few assumptions are made and must be respected when you implement your own
    rules derived from BaseRule.

    The class will serialize all the data attributes of a class which do not start with
    underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member
    of the class. As such, to exclude any of your internal data attributes, either named
    them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.

    The serialize will look into a classes __dict__ and therefore the class must have a
    __dict__.

    For the de-serialization to work generically, the name of the attributes must match the
    names of the arguments in the __init__. This is quite an important and restrictive
    constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.

    Note:
        Use the same name for attributes on self as the respective arguments in __init__.

    """
    dct = {
        "name": self.name,
        "description": self.description,
    }
    dct.update({
        attr: value for attr, value in self.__dict__.items() 
            if not attr.startswith("_") and attr not in self.EXCLUDE_FROM_SERIALIZE
            and attr not in dct.keys()
    })
    return {
        self.__class__.__name__: dct
    }

to_yaml(self)

Serialize the rule to yaml.

Source code in etlrules/rule.py
def to_yaml(self):
    """ Serialize the rule to yaml. """
    return yaml.safe_dump(self.to_dict())

BinaryOpBaseRule (BaseRule)

Base class for binary operation rules (ie operations taking two data frames as input).

Source code in etlrules/rule.py
class BinaryOpBaseRule(BaseRule):
    """ Base class for binary operation rules (ie operations taking two data frames as input). """

    def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_output=named_output, name=name, description=description, strict=strict)
        assert named_input_left is None or isinstance(named_input_left, str) and named_input_left
        assert named_input_right is None or isinstance(named_input_right, str) and named_input_right
        self.named_input_left = named_input_left
        self.named_input_right = named_input_right

    def _get_input_df_left(self, data: RuleData):
        if self.named_input_left is None:
            return data.get_main_output()
        return data.get_named_output(self.named_input_left)

    def _get_input_df_right(self, data: RuleData):
        if self.named_input_right is None:
            return data.get_main_output()
        return data.get_named_output(self.named_input_right)

    def get_all_named_inputs(self):
        yield self.named_input_left
        yield self.named_input_right

get_all_named_inputs(self)

Yields all the named inputs of this rule (as strings).

By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.

Source code in etlrules/rule.py
def get_all_named_inputs(self):
    yield self.named_input_left
    yield self.named_input_right

UnaryOpBaseRule (BaseRule)

Base class for unary operation rules (ie operations taking a single data frame as input).

Source code in etlrules/rule.py
class UnaryOpBaseRule(BaseRule):
    """ Base class for unary operation rules (ie operations taking a single data frame as input). """

    def __init__(self, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
        super().__init__(named_output=named_output, name=name, description=description, strict=strict)
        assert named_input is None or isinstance(named_input, str) and named_input
        self.named_input = named_input

    def _get_input_df(self, data: RuleData):
        if self.named_input is None:
            return data.get_main_output()
        return data.get_named_output(self.named_input)

    def get_all_named_inputs(self):
        yield self.named_input

get_all_named_inputs(self)

Yields all the named inputs of this rule (as strings).

By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.

Source code in etlrules/rule.py
def get_all_named_inputs(self):
    yield self.named_input

runner

load_plan(plan_file, backend)

Load a plan from a yaml file.

Basic usage:

1
2
from etlrules import load_plan
plan = load_plan("/home/someuser/some_plan.yml", "pandas")

Parameters:

Name Type Description Default
plan_file str

A path to a yaml file with the plan definition

required
backend str

One of the supported backends (e.g. pandas, polars, etc.)

required

Returns:

Type Description
Plan

A Plan instance deserialized from the yaml file.

Source code in etlrules/runner.py
def load_plan(plan_file: str, backend: str) -> Plan:
    """ Load a plan from a yaml file.

    Basic usage:

        from etlrules import load_plan
        plan = load_plan("/home/someuser/some_plan.yml", "pandas")

    Args:
        plan_file: A path to a yaml file with the plan definition
        backend: One of the supported backends (e.g. pandas, polars, etc.)

    Returns:
        A Plan instance deserialized from the yaml file.
    """
    with open(plan_file, 'rt') as plan_f:
        contents = plan_f.read()
    return Plan.from_yaml(contents, backend)

run_plan(plan_file, backend)

Runs a plan from a yaml file with a given backend.

The backend referers to the underlying dataframe library used to run the plan.

Basic usage:

1
2
from etlrules import run_plan
data = run_plan("/home/someuser/some_plan.yml", "pandas")

Parameters:

Name Type Description Default
plan_file str

A path to a yaml file with the plan definition

required
backend str

One of the supported backends

required

Note

The supported backends: pandas, polars, dask (work in progress)

Returns:

Type Description
RuleData

A RuleData instance which contains the result dataframe(s).

Source code in etlrules/runner.py
def run_plan(plan_file: str, backend: str) -> RuleData:
    """ Runs a plan from a yaml file with a given backend.

    The backend referers to the underlying dataframe library used to run
    the plan.

    Basic usage:

        from etlrules import run_plan
        data = run_plan("/home/someuser/some_plan.yml", "pandas")

    Args:
        plan_file: A path to a yaml file with the plan definition
        backend: One of the supported backends

    Note:
        The supported backends:
            pandas, polars, dask (work in progress)

    Returns:
        A RuleData instance which contains the result dataframe(s).
    """
    plan = load_plan(plan_file, backend)
    args = get_args_parser(plan)
    context = {}
    context.update(args)
    etlrules_tempdir, etlrules_tempdir_cleanup = get_etlrules_temp_dir()
    context.update({
        "etlrules_tempdir": etlrules_tempdir,
        "etlrules_tempdir_cleanup": etlrules_tempdir_cleanup,
    })
    try:
        data = RuleData(context=context)
        engine = RuleEngine(plan)
        engine.run(data)
    finally:
        if etlrules_tempdir_cleanup:
            shutil.rmtree(etlrules_tempdir)

    return data