API Reference
Top-level package for ETLrules.
backends
special
¶
common
special
¶
aggregate
¶
AggregateRule (UnaryOpBaseRule)
¶
Performs a SQL-like groupby and aggregation.
It takes a list of columns to group by and the result will have one row for each unique combination
of values in the group_by columns.
The rest of the columns (not in the group_by) can be aggregated using either pre-defined aggregations
or using custom python expressions.
Parameters:
Name | Type | Description | Default | ||
---|---|---|---|---|---|
group_by |
Iterable[str] |
A list of columns to group the result by |
required | ||
aggregations |
Optional[Mapping[str, str]] |
A mapping {column_name: aggregation_function} which specifies how to aggregate
columns which are not in the group_by list.
|
None |
||
aggregation_expressions |
Optional[Mapping[str, str]] |
A mapping {column_name: aggregation_expression} which specifies how to aggregate
columns which are not in the group_by list. Example::
The dask backend doesn't support aggregation_expressions. |
None |
||
aggregation_types |
Optional[Mapping[str, str]] |
An optional mapping of {column_name: column_type} which converts the respective output column to the given type. The supported types are: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, string, boolean, datetime and timedelta. |
None |
||
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. |
None |
||
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. |
None |
||
name |
Optional[str] |
Give the rule a name. Optional. |
None |
||
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. |
None |
||
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised if a column appears in multiple places in group_by/aggregations/aggregation_expressions. |
ExpressionSyntaxError |
raised if any aggregation expression (if any are passed in) has a Python syntax error. |
MissingColumnError |
raised in strict mode only if a column specified in aggregations or aggregation_expressions is missing from the input dataframe. If aggregation_types are specified, it is raised in strict mode if a column in the aggregation_types is missing from the input dataframe. |
UnsupportedTypeError |
raised if a type specified in aggregation_types is not supported. |
ValueError |
raised if a column in aggregations is trying to be aggregated using an unknown aggregate function |
TypeError |
raised if an operation is not supported between the types involved |
NameError |
raised if an unknown variable is used |
Note
Other Python exceptions can be raised when custom aggregation expressions are used, depending on what the expression is doing.
Note
Any columns not in the group_by list and not present in either aggregations or aggregation_expressions will be dropped from the result.
Source code in etlrules/backends/common/aggregate.py
class AggregateRule(UnaryOpBaseRule):
"""Performs a SQL-like groupby and aggregation.
It takes a list of columns to group by and the result will have one row for each unique combination
of values in the group_by columns.
The rest of the columns (not in the group_by) can be aggregated using either pre-defined aggregations
or using custom python expressions.
Args:
group_by: A list of columns to group the result by
aggregations: A mapping {column_name: aggregation_function} which specifies how to aggregate
columns which are not in the group_by list.
The following list of aggregation functions are supported::
min: minimum of the values in the group
max: minimum of the values in the group
mean: The mathematical mean value in the group
count: How many values are in the group, including NA
countNoNA: How many values are in the group, excluding NA
sum: The sum of the values in the group
first: The first value in the group
last: The last value in the group
list: Produces a python list with all the values in the group, excluding NA
tuple: Like list above but produces a tuple
csv: Produces a comma separated string of values, exluding NA
aggregation_expressions: A mapping {column_name: aggregation_expression} which specifies how to aggregate
columns which are not in the group_by list.
The aggregation expression is a string representing a valid Python expression which gets evaluated.
The input will be in a variable `values`. `isnull` can be used to filter out NA.
Example::
{"C": "';'.join(str(v) for v in values if not isnull(v))"}
The above aggregates the column C by producing a ; separated string of values in the group, excluding NA.
The dask backend doesn't support aggregation_expressions.
aggregation_types: An optional mapping of {column_name: column_type} which converts the respective output
column to the given type. The supported types are: int8, int16, int32, int64, uint8, uint16,
uint32, uint64, float32, float64, string, boolean, datetime and timedelta.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised if a column appears in multiple places in group_by/aggregations/aggregation_expressions.
ExpressionSyntaxError: raised if any aggregation expression (if any are passed in) has a Python syntax error.
MissingColumnError: raised in strict mode only if a column specified in aggregations or aggregation_expressions
is missing from the input dataframe. If aggregation_types are specified, it is raised in strict mode if a column
in the aggregation_types is missing from the input dataframe.
UnsupportedTypeError: raised if a type specified in aggregation_types is not supported.
ValueError: raised if a column in aggregations is trying to be aggregated using an unknown aggregate function
TypeError: raised if an operation is not supported between the types involved
NameError: raised if an unknown variable is used
Note:
Other Python exceptions can be raised when custom aggregation expressions are used, depending on what the expression is doing.
Note:
Any columns not in the group_by list and not present in either aggregations or aggregation_expressions will be dropped from the result.
"""
AGGREGATIONS = {}
EXCLUDE_FROM_COMPARE = ("_aggs",)
def __init__(
self,
group_by: Iterable[str],
aggregations: Optional[Mapping[str, str]] = None,
aggregation_expressions: Optional[Mapping[str, str]] = None,
aggregation_types: Optional[Mapping[str, str]] = None,
named_input: Optional[str] = None,
named_output: Optional[str] = None,
name: Optional[str] = None,
description: Optional[str] = None,
strict: bool = True,
):
super().__init__(
named_input=named_input, named_output=named_output, name=name,
description=description, strict=strict
)
self.group_by = [col for col in group_by]
assert aggregations or aggregation_expressions, "One of aggregations or aggregation_expressions must be specified."
if aggregations is not None:
self.aggregations = {}
for col, agg_func in aggregations.items():
if col in self.group_by:
raise ColumnAlreadyExistsError(f"Column {col} appears in group_by and cannot be aggregated.")
if agg_func not in self.AGGREGATIONS:
raise ValueError(f"'{agg_func}' is not a supported aggregation function.")
self.aggregations[col] = agg_func
else:
self.aggregations = None
self._aggs = {}
if self.aggregations:
if agg_func in ('list', 'csv'):
perf_logger.warning("Aggregation '%s' in AggregateRule is not vectorized and might hurt the overall performance", agg_func)
self._aggs.update({
key: self.AGGREGATIONS[agg_func]
for key, agg_func in (aggregations or {}).items()
})
if aggregation_expressions is not None:
self.aggregation_expressions = {}
for col, agg_expr in aggregation_expressions.items():
if col in self.group_by:
raise ColumnAlreadyExistsError(f"Column {col} appears in group_by and cannot be aggregated.")
if col in self._aggs:
raise ColumnAlreadyExistsError(f"Column {col} is already being aggregated.")
try:
perf_logger.warning("Aggregation expression '%s' in AggregateRule is not vectorized and might hurt the overall performance", agg_expr)
_ast_expr = ast.parse(agg_expr, filename=f"{col}_expression.py", mode="eval")
_compiled_expr = compile(_ast_expr, filename=f"{col}_expression.py", mode="eval")
self._aggs[col] = lambda values, bound_compiled_expr=_compiled_expr: eval(
bound_compiled_expr, {"isnull": isnull}, {"values": values}
)
except SyntaxError as exc:
raise ExpressionSyntaxError(f"Error in aggregation expression for column '{col}': '{agg_expr}': {str(exc)}")
self.aggregation_expressions[col] = agg_expr
if aggregation_types is not None:
self.aggregation_types = {}
for col, col_type in aggregation_types.items():
if col not in self._aggs and col not in self.group_by:
if self.strict:
raise MissingColumnError(f"Column {col} is neither in the group by columns nor in the aggregations.")
else:
continue
if col_type not in SUPPORTED_TYPES:
raise UnsupportedTypeError(f"Unsupported type '{col_type}' for column '{col}'.")
self.aggregation_types[col] = col_type
else:
self.aggregation_types = None
def do_aggregate(self, df, aggs):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
df_columns_set = set(df.columns)
if not set(self._aggs) <= df_columns_set:
if self.strict:
raise MissingColumnError(f"Missimg columns to aggregate by: {set(self._aggs) - df_columns_set}.")
aggs = {
col: agg for col, agg in self._aggs.items() if col in df_columns_set
}
else:
aggs = self._aggs
df = self.do_aggregate(df, aggs)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/aggregate.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
df_columns_set = set(df.columns)
if not set(self._aggs) <= df_columns_set:
if self.strict:
raise MissingColumnError(f"Missimg columns to aggregate by: {set(self._aggs) - df_columns_set}.")
aggs = {
col: agg for col, agg in self._aggs.items() if col in df_columns_set
}
else:
aggs = self._aggs
df = self.do_aggregate(df, aggs)
self._set_output_df(data, df)
base
¶
BaseAssignColumnRule (UnaryOpBaseRule, ColumnsInOutMixin)
¶
Source code in etlrules/backends/common/base.py
class BaseAssignColumnRule(UnaryOpBaseRule, ColumnsInOutMixin):
def __init__(self, input_column: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert input_column and isinstance(input_column, str), "input_column must be a non-empty string."
assert output_column is None or (output_column and isinstance(output_column, str)), "output_column must be None or a non-empty string."
self.input_column = input_column
self.output_column = output_column
def do_apply(self, df, col):
raise NotImplementedError()
def apply(self, data: RuleData):
df = self._get_input_df(data)
input_column, output_column = self.validate_in_out_columns(df.columns, self.input_column, self.output_column, self.strict)
df = self.assign_do_apply(df, input_column, output_column)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/base.py
def apply(self, data: RuleData):
df = self._get_input_df(data)
input_column, output_column = self.validate_in_out_columns(df.columns, self.input_column, self.output_column, self.strict)
df = self.assign_do_apply(df, input_column, output_column)
self._set_output_df(data, df)
basic
¶
DedupeRule (UnaryOpBaseRule)
¶
De-duplicates by dropping duplicates using a set of columns to determine the duplicates.
It has logic to keep the first, last or none of the duplicate in a set of duplicates.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Iterable[str] |
A subset of columns in the data frame which are used to determine the set of duplicates. Any rows that have the same values in these columns are considered to be duplicates. |
required |
keep |
Literal['first', 'last', 'none'] |
What to keep in the de-duplication process. One of: first: keeps the first row in the duplicate set last: keeps the last row in the duplicate set none: drops all the duplicates |
'first' |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised when a column specified to deduplicate on doesn't exist in the input data frame. |
Note
MissingColumnError is raised in both strict and non-strict modes. This is because the rule cannot operate reliably without a correct set of columns.
Source code in etlrules/backends/common/basic.py
class DedupeRule(UnaryOpBaseRule):
""" De-duplicates by dropping duplicates using a set of columns to determine the duplicates.
It has logic to keep the first, last or none of the duplicate in a set of duplicates.
Args:
columns: A subset of columns in the data frame which are used to determine the set of duplicates.
Any rows that have the same values in these columns are considered to be duplicates.
keep: What to keep in the de-duplication process. One of:
first: keeps the first row in the duplicate set
last: keeps the last row in the duplicate set
none: drops all the duplicates
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised when a column specified to deduplicate on doesn't exist in the input data frame.
Note:
MissingColumnError is raised in both strict and non-strict modes. This is because the rule cannot operate reliably without a correct set of columns.
"""
KEEP_FIRST = 'first'
KEEP_LAST = 'last'
KEEP_NONE = 'none'
ALL_KEEPS = (KEEP_FIRST, KEEP_LAST, KEEP_NONE)
def __init__(self, columns: Iterable[str], keep: Literal[KEEP_FIRST, KEEP_LAST, KEEP_NONE]=KEEP_FIRST, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.columns = [col for col in columns]
assert all(
isinstance(col, str) for col in self.columns
), "DedupeRule: columns must be strings"
assert keep in self.ALL_KEEPS, f"DedupeRule: keep must be one of: {self.ALL_KEEPS}"
self.keep = keep
def do_dedupe(self, df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
if not set(self.columns) <= set(df.columns):
raise MissingColumnError(f"Missing column(s) to dedupe on: {set(self.columns) - set(df.columns)}")
df = self.do_dedupe(df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/basic.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
if not set(self.columns) <= set(df.columns):
raise MissingColumnError(f"Missing column(s) to dedupe on: {set(self.columns) - set(df.columns)}")
df = self.do_dedupe(df)
self._set_output_df(data, df)
ExplodeValuesRule (UnaryOpBaseRule, ColumnsInOutMixin)
¶
Explode a list of values into multiple rows with each value on a separate row
Example::
1 2 3 4 5 6 |
|
ExplodeValuesRule("A").apply(df)
Result::
1 2 3 4 5 6 7 8 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A column with values to round as per the specified scale. |
required |
column_type |
Optional[str] |
An optional string with the type of the resulting exploded column. When not specified, the column_type is backend implementation specific. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input column doesn't exist in the input dataframe. |
Source code in etlrules/backends/common/basic.py
class ExplodeValuesRule(UnaryOpBaseRule, ColumnsInOutMixin):
""" Explode a list of values into multiple rows with each value on a separate row
Example::
Given df:
| A |
|-----------|
| [1, 2, 3] |
| [4, 5] |
| [6] |
> ExplodeValuesRule("A").apply(df)
Result::
| A |
|-----|
| 1 |
| 2 |
| 3 |
| 4 |
| 5 |
| 6 |
Args:
input_column: A column with values to round as per the specified scale.
column_type: An optional string with the type of the resulting exploded column. When not specified, the
column_type is backend implementation specific.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input column doesn't exist in the input dataframe.
"""
def __init__(self, input_column: str, column_type: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.input_column = input_column
self.column_type = column_type
if self.column_type is not None and self.column_type not in SUPPORTED_TYPES:
raise UnsupportedTypeError(f"Type '{self.column_type}' is not supported.")
def _validate_input_column(self, df):
if self.input_column not in df.columns:
raise MissingColumnError(f"Column '{self.input_column}' is not present in the input dataframe.")
ProjectRule (UnaryOpBaseRule)
¶
Reshapes the data frame to keep, eliminate or re-order the set of columns.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Iterable[str] |
The list of columns to keep or eliminate from the data frame. The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns. |
required |
exclude |
bool |
When set to True, the columns in the columns arg will be excluded from the data frame. Boolean. Default: False In strict mode, if any column specified in the columns arg doesn't exist in the input data frame, a MissingColumnError exception is raised. In non strict mode, the missing columns are ignored. |
False |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised in strict mode only, if any columns are missing from the input data frame. |
Source code in etlrules/backends/common/basic.py
class ProjectRule(UnaryOpBaseRule):
""" Reshapes the data frame to keep, eliminate or re-order the set of columns.
Args:
columns (Iterable[str]): The list of columns to keep or eliminate from the data frame.
The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.
exclude (bool): When set to True, the columns in the columns arg will be excluded from the data frame. Boolean. Default: False
In strict mode, if any column specified in the columns arg doesn't exist in the input data frame, a MissingColumnError exception is raised.
In non strict mode, the missing columns are ignored.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised in strict mode only, if any columns are missing from the input data frame.
"""
def __init__(self, columns: Iterable[str], exclude=False, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.columns = [col for col in columns]
assert all(
isinstance(col, str) for col in self.columns
), "ProjectRule: columns must be strings"
self.exclude = exclude
def _get_remaining_columns(self, df_column_names):
columns_set = set(self.columns)
df_column_names_set = set(df_column_names)
if self.strict:
if not columns_set <= df_column_names_set:
raise MissingColumnError(f"No such columns: {columns_set - df_column_names_set}. Available columns: {df_column_names_set}.")
if self.exclude:
remaining_columns = [
col for col in df_column_names if col not in columns_set
]
else:
remaining_columns = [
col for col in self.columns if col in df_column_names_set
]
return remaining_columns
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
remaining_columns = self._get_remaining_columns(df.columns)
df = df[remaining_columns]
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/basic.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
remaining_columns = self._get_remaining_columns(df.columns)
df = df[remaining_columns]
self._set_output_df(data, df)
RenameRule (UnaryOpBaseRule)
¶
Renames a set of columns in the data frame.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mapper |
Mapping[str, str] |
A dictionary of old names (keys) and new names (values) to be used for the rename operation The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised in strict mode only, if any columns (keys) are missing from the input data frame. |
Source code in etlrules/backends/common/basic.py
class RenameRule(UnaryOpBaseRule):
""" Renames a set of columns in the data frame.
Args:
mapper: A dictionary of old names (keys) and new names (values) to be used for the rename operation
The order of column names will be reflected in the result data frame, so this rule can be used to re-order columns.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised in strict mode only, if any columns (keys) are missing from the input data frame.
"""
def __init__(self, mapper: Mapping[str, str], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
assert isinstance(mapper, dict), "mapper needs to be a dict {old_name:new_name}"
assert all(isinstance(key, str) and isinstance(val, str) for key, val in mapper.items()), "mapper needs to be a dict {old_name:new_name} where the names are str"
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.mapper = mapper
def do_rename(self, df, mapper):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
mapper = self.mapper
df_columns = set(df.columns)
if not set(self.mapper.keys()) <= df_columns:
if self.strict:
raise MissingColumnError(f"Missing columns to rename: {set(self.mapper.keys()) - df_columns}")
else:
mapper = {k: v for k, v in self.mapper.items() if k in df_columns}
df = self.do_rename(df, mapper)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/basic.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
mapper = self.mapper
df_columns = set(df.columns)
if not set(self.mapper.keys()) <= df_columns:
if self.strict:
raise MissingColumnError(f"Missing columns to rename: {set(self.mapper.keys()) - df_columns}")
else:
mapper = {k: v for k, v in self.mapper.items() if k in df_columns}
df = self.do_rename(df, mapper)
self._set_output_df(data, df)
ReplaceRule (BaseAssignColumnRule)
¶
Replaces some some values (or regular expressions) with another set of values (or regular expressions).
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A column with the input values. |
required |
values |
Iterable[Union[int, float, str]] |
A sequence of values to replace. Regular expressions can be used to match values more widely, in which case, the regex parameter must be set to True. Values can be any supported types but they should match the type of the columns. |
required |
new_values |
Iterable[Union[int, float, str]] |
A sequence of the same length as values. Each value within new_values will replace the corresponding value in values (at the same index). New values can be any supported types but they should match the type of the columns. |
required |
regex |
True if all the values and new_values are to be interpreted as regular expressions. Default: False. regex=True is only applicable to string columns. |
False |
|
output_column |
Optional[str] |
An optional column to hold the result with the new values. Optional. If provided, if must have the same length as the columns sequence. The existing columns are unchanged, and new columns are created with the upper case values. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/basic.py
class ReplaceRule(BaseAssignColumnRule):
""" Replaces some some values (or regular expressions) with another set of values (or regular expressions).
Basic usage::
# replaces A with new_A and b with new_b in col_A
rule = ReplaceRule("col_A", values=["A", "b"], new_values=["new_A", "new_b"])
rule.apply(data)
# replaces 1 with 3 and 2 with 4 in the col_I column
rule = ReplaceRule("col_I", values=[1, 2], new_values=[3, 4])
rule.apply(data)
Args:
input_column (str): A column with the input values.
values: A sequence of values to replace. Regular expressions can be used to match values more widely,
in which case, the regex parameter must be set to True.
Values can be any supported types but they should match the type of the columns.
new_values: A sequence of the same length as values. Each value within new_values will replace the
corresponding value in values (at the same index).
New values can be any supported types but they should match the type of the columns.
regex: True if all the values and new_values are to be interpreted as regular expressions. Default: False.
regex=True is only applicable to string columns.
output_column (Optional[str]): An optional column to hold the result with the new values.
Optional. If provided, if must have the same length as the columns sequence.
The existing columns are unchanged, and new columns are created with the upper case values.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
def __init__(self, input_column: str, values: Iterable[Union[int,float,str]], new_values: Iterable[Union[int,float,str]], regex=False, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
self.values = [val for val in values]
self.new_values = [val for val in new_values]
assert len(self.values) == len(self.new_values), "values and new_values must be of the same length."
assert self.values, "values must not be empty."
self.regex = regex
if self.regex:
assert all(isinstance(val, str) for val in self.values)
assert all(isinstance(val, str) for val in self.new_values)
def do_apply(self, df, col):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
RulesBlock (UnaryOpBaseRule)
¶
Groups rules into encapsulated blocks or units of rules that achieve one thing. Blocks are reusable and encapsulated to reduce complexity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rules |
Iterable[etlrules.rule.BaseRule] |
An iterable of rules which are part of this block. The first rule in the block will take its input from the named_input of the RulesBlock (if any, if not from the main output of the previous rule). The last rule in the block will publish the output as the named_output of the RulesBlock (if any, or the main output of the block). Any named outputs in the block are not exposed to the rules outside of the block (proper encapsulation). |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Source code in etlrules/backends/common/basic.py
class RulesBlock(UnaryOpBaseRule):
""" Groups rules into encapsulated blocks or units of rules that achieve one thing.
Blocks are reusable and encapsulated to reduce complexity.
Args:
rules: An iterable of rules which are part of this block.
The first rule in the block will take its input from the named_input of the RulesBlock (if any, if not from the main output of the previous rule).
The last rule in the block will publish the output as the named_output of the RulesBlock (if any, or the main output of the block).
Any named outputs in the block are not exposed to the rules outside of the block (proper encapsulation).
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
"""
def __init__(self, rules: Iterable[BaseRule], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
self._rules = [rule for rule in rules]
assert self._rules, "RulesBlock: Empty rules set provided."
assert all(isinstance(rule, BaseRule) for rule in self._rules), [rule for rule in self._rules if not isinstance(rule, BaseRule)]
assert self._rules[0].named_input is None, "First rule in a RulesBlock must consume the main input/output"
assert self._rules[-1].named_input is None, "Last rule in a RulesBlock must produce the main output"
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
def apply(self, data):
super().apply(data)
data2 = RuleData(
main_input=self._get_input_df(data),
named_inputs={k: v for k, v in data.get_named_outputs()},
strict=self.strict
)
for rule in self._rules:
rule.apply(data2)
self._set_output_df(data, data2.get_main_output())
def to_dict(self) -> dict:
dct = super().to_dict()
dct[self.__class__.__name__]["rules"] = [rule.to_dict() for rule in self._rules]
return dct
@classmethod
def from_dict(cls, dct, backend, additional_packages: Optional[Sequence[str]]=None) -> 'RulesBlock':
dct = dct["RulesBlock"]
rules = [BaseRule.from_dict(rule, backend, additional_packages) for rule in dct.get("rules", ())]
kwargs = {k: v for k, v in dct.items() if k != "rules"}
return cls(rules=rules, **kwargs)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/basic.py
def apply(self, data):
super().apply(data)
data2 = RuleData(
main_input=self._get_input_df(data),
named_inputs={k: v for k, v in data.get_named_outputs()},
strict=self.strict
)
for rule in self._rules:
rule.apply(data2)
self._set_output_df(data, data2.get_main_output())
from_dict(dct, backend, additional_packages=None)
classmethod
¶
Creates a rule instance from a python dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dct |
A dictionary to create the plan from |
required | |
backend |
One of the supported backends (ie pandas) |
required | |
additional_packages |
Optional[Sequence[str]] |
Optional list of other packages to look for rules in |
None |
Returns:
Type | Description |
---|---|
RulesBlock |
A new instance of a Plan. |
Source code in etlrules/backends/common/basic.py
@classmethod
def from_dict(cls, dct, backend, additional_packages: Optional[Sequence[str]]=None) -> 'RulesBlock':
dct = dct["RulesBlock"]
rules = [BaseRule.from_dict(rule, backend, additional_packages) for rule in dct.get("rules", ())]
kwargs = {k: v for k, v in dct.items() if k != "rules"}
return cls(rules=rules, **kwargs)
to_dict(self)
¶
Serializes this rule to a python dictionary.
This is a generic implementation that should work for all derived classes and therefore you shouldn't need to override, although you can do so.
Because it aims to be generic and work correctly for all the derived classes, a few assumptions are made and must be respected when you implement your own rules derived from BaseRule.
The class will serialize all the data attributes of a class which do not start with underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member of the class. As such, to exclude any of your internal data attributes, either named them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.
The serialize will look into a classes dict and therefore the class must have a dict.
For the de-serialization to work generically, the name of the attributes must match the names of the arguments in the init. This is quite an important and restrictive constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.
Note
Use the same name for attributes on self as the respective arguments in init.
Source code in etlrules/backends/common/basic.py
def to_dict(self) -> dict:
dct = super().to_dict()
dct[self.__class__.__name__]["rules"] = [rule.to_dict() for rule in self._rules]
return dct
SortRule (UnaryOpBaseRule)
¶
Sort the input dataframe by the given columns, either ascending or descending.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sort_by |
Iterable[str] |
Either a single column speified as a string or a list or tuple of columns to sort by |
required |
ascending |
Union[bool, Iterable[bool]] |
Whether to sort ascending or descending. Boolean. Default: True |
True |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised when a column in the sort_by doesn't exist in the input dataframe. |
Note
When multiple columns are specified, the first column decides the sort order. For any rows that have the same value in the first column, the second column is used to decide the sort order within that group and so on.
Source code in etlrules/backends/common/basic.py
class SortRule(UnaryOpBaseRule):
""" Sort the input dataframe by the given columns, either ascending or descending.
Args:
sort_by: Either a single column speified as a string or a list or tuple of columns to sort by
ascending: Whether to sort ascending or descending. Boolean. Default: True
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised when a column in the sort_by doesn't exist in the input dataframe.
Note:
When multiple columns are specified, the first column decides the sort order.
For any rows that have the same value in the first column, the second column is used to decide the sort order within that group and so on.
"""
def __init__(self, sort_by: Iterable[str], ascending: Union[bool,Iterable[bool]]=True, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
if isinstance(sort_by, str):
self.sort_by = [sort_by]
else:
self.sort_by = [s for s in sort_by]
assert isinstance(ascending, bool) or (isinstance(ascending, (list, tuple)) and all(isinstance(val, bool) for val in ascending) and len(ascending) == len(self.sort_by)), "ascending must be a bool or a list of bool of the same len as sort_by"
self.ascending = ascending
def do_sort(self, df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
if not set(self.sort_by) <= set(df.columns):
raise MissingColumnError(f"Column(s) {set(self.sort_by) - set(df.columns)} are missing from the input dataframe.")
df = self.do_sort(df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/basic.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
if not set(self.sort_by) <= set(df.columns):
raise MissingColumnError(f"Column(s) {set(self.sort_by) - set(df.columns)} are missing from the input dataframe.")
df = self.do_sort(df)
self._set_output_df(data, df)
concat
¶
HConcatRule (BinaryOpBaseRule)
¶
Horizontally concatenates two dataframe with the result having the columns from the left dataframe followed by the columns from the right dataframe.
The columns from the left dataframe will be followed by the columns from the right dataframe in the result dataframe. The two dataframes must not have columns with the same name.
Example::
1 2 3 4 5 6 7 8 9 10 11 |
|
After a concat(left, right), the result will look like::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised if the two dataframes have columns with the same name. |
SchemaError |
raised in strict mode only if the two dataframes have different number of rows. |
Source code in etlrules/backends/common/concat.py
class HConcatRule(BinaryOpBaseRule):
""" Horizontally concatenates two dataframe with the result having the columns from the left dataframe followed by the columns from the right dataframe.
The columns from the left dataframe will be followed by the columns from the right dataframe in the result dataframe.
The two dataframes must not have columns with the same name.
Example::
Left dataframe:
| A | B |
| a | 1 |
| b | 2 |
| c | 3 |
Right dataframe:
| C | D |
| d | 4 |
| e | 5 |
| f | 6 |
After a concat(left, right), the result will look like::
| A | B | C | D |
| a | 1 | d | 4 |
| b | 2 | e | 5 |
| c | 3 | f | 6 |
Args:
named_input_left: Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right: Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised if the two dataframes have columns with the same name.
SchemaError: raised in strict mode only if the two dataframes have different number of rows.
"""
def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
# This __init__ not really needed but the type annotations are extracted from it
super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)
def do_concat(self, left_df, right_df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
overlapping_names = set(left_df.columns) & set(right_df.columns)
if overlapping_names:
raise ColumnAlreadyExistsError(f"Column(s) {overlapping_names} exist in both dataframes.")
if self.strict:
if len(left_df) != len(right_df):
raise SchemaError(f"HConcat needs the two dataframe to have the same number of rows. left df={len(left_df)} rows, right df={len(right_df)} rows.")
df = self.do_concat(left_df, right_df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/concat.py
def apply(self, data):
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
overlapping_names = set(left_df.columns) & set(right_df.columns)
if overlapping_names:
raise ColumnAlreadyExistsError(f"Column(s) {overlapping_names} exist in both dataframes.")
if self.strict:
if len(left_df) != len(right_df):
raise SchemaError(f"HConcat needs the two dataframe to have the same number of rows. left df={len(left_df)} rows, right df={len(right_df)} rows.")
df = self.do_concat(left_df, right_df)
self._set_output_df(data, df)
VConcatRule (BinaryOpBaseRule)
¶
Vertically concatenates two dataframe with the result having the rows from the left dataframe followed by the rows from the right dataframe.
The rows of the right dataframe are added at the bottom of the rows from the left dataframe in the result dataframe.
Example::
1 2 3 4 5 6 7 8 9 10 11 |
|
After a concat(left, right), the result will look like::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
subset_columns |
Optional[Iterable[str]] |
A subset list of columns available in both dataframes. Only these columns will be concated. The effect is similar to doing a ProjectRule(subset_columns) on both dataframes before the concat. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any subset columns specified are missing from any of the dataframe. |
SchemaError |
raised in strict mode only if the columns differ between the two dataframes and subset_columns is not specified. |
Note
In strict mode, as described above, SchemaError is raised if the columns are not the same (names, types can be inferred). In non-strict mode, columns are not checked and values are filled with NA when missing.
Source code in etlrules/backends/common/concat.py
class VConcatRule(BinaryOpBaseRule):
""" Vertically concatenates two dataframe with the result having the rows from the left dataframe followed by the rows from the right dataframe.
The rows of the right dataframe are added at the bottom of the rows from the left dataframe in the result dataframe.
Example::
Left dataframe:
| A | B |
| a | 1 |
| b | 2 |
| c | 3 |
Right dataframe:
| A | B |
| d | 4 |
| e | 5 |
| f | 6 |
After a concat(left, right), the result will look like::
| A | B |
| a | 1 |
| b | 2 |
| c | 3 |
| d | 4 |
| e | 5 |
| f | 6 |
Args:
named_input_left: Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right: Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
subset_columns: A subset list of columns available in both dataframes.
Only these columns will be concated.
The effect is similar to doing a ProjectRule(subset_columns) on both dataframes before the concat.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any subset columns specified are missing from any of the dataframe.
SchemaError: raised in strict mode only if the columns differ between the two dataframes and subset_columns is not specified.
Note:
In strict mode, as described above, SchemaError is raised if the columns are not the same (names, types can be inferred).
In non-strict mode, columns are not checked and values are filled with NA when missing.
"""
def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], subset_columns: Optional[Iterable[str]]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)
self.subset_columns = [col for col in subset_columns] if subset_columns is not None else None
def do_concat(self, left_df, right_df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
if self.subset_columns:
if not set(self.subset_columns) <= set(left_df.columns):
raise MissingColumnError(f"Missing columns in the left dataframe of the concat operation: {set(self.subset_columns) - set(left_df.columns)}")
if not set(self.subset_columns) <= set(right_df.columns):
raise MissingColumnError(f"Missing columns in the right dataframe of the concat operation: {set(self.subset_columns) - set(right_df.columns)}")
left_df = left_df[self.subset_columns]
right_df = right_df[self.subset_columns]
if self.strict:
if set(left_df.columns) != set(right_df.columns):
raise SchemaError(f"VConcat needs both dataframe have the same schema. Missing columns in the right df: {set(right_df.columns) - set(left_df.columns)}. Missing columns in the left df: {set(left_df.columns) - set(right_df.columns)}")
df = self.do_concat(left_df, right_df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/concat.py
def apply(self, data):
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
if self.subset_columns:
if not set(self.subset_columns) <= set(left_df.columns):
raise MissingColumnError(f"Missing columns in the left dataframe of the concat operation: {set(self.subset_columns) - set(left_df.columns)}")
if not set(self.subset_columns) <= set(right_df.columns):
raise MissingColumnError(f"Missing columns in the right dataframe of the concat operation: {set(self.subset_columns) - set(right_df.columns)}")
left_df = left_df[self.subset_columns]
right_df = right_df[self.subset_columns]
if self.strict:
if set(left_df.columns) != set(right_df.columns):
raise SchemaError(f"VConcat needs both dataframe have the same schema. Missing columns in the right df: {set(right_df.columns) - set(left_df.columns)}. Missing columns in the left df: {set(left_df.columns) - set(right_df.columns)}")
df = self.do_concat(left_df, right_df)
self._set_output_df(data, df)
conditions
¶
FilterRule (UnaryOpBaseRule)
¶
Exclude rows based on a condition.
Example::
1 2 3 4 5 6 7 8 |
|
Result::
1 2 |
|
Same example using discarded_matching_rows=True::
1 2 |
|
Result::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition_expression |
str |
An expression as a string. The expression must evaluate to a boolean scalar or a boolean series. |
required |
discard_matching_rows |
bool |
By default the rows matching the condition (ie where the condition is True) are kept, the rest of the rows being dropped from the result. Setting this parameter to True essentially inverts the condition, so the rows matching the condition are discarded and the rest of the rows kept. Default: False. |
False |
named_output_discarded |
Optional[str] |
A named output for the records being discarded if those need to be kept for further processing. Default: None, which doesn't keep track of discarded records. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ExpressionSyntaxError |
raised if the column expression has a Python syntax error. |
TypeError |
raised if an operation is not supported between the types involved |
NameError |
raised if an unknown variable is used |
KeyError |
raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN']) |
Source code in etlrules/backends/common/conditions.py
class FilterRule(UnaryOpBaseRule):
""" Exclude rows based on a condition.
Example::
Given df:
| A | B |
| 1 | 2 |
| 5 | 3 |
| 3 | 4 |
rule = FilterRule("df['A'] > df['B']")
rule.apply(df)
Result::
| A | B |
| 5 | 3 |
Same example using discarded_matching_rows=True::
rule = FilterRule("df['A'] > df['B']", discard_matching_rows=True)
rule.apply(df)
Result::
| A | B |
| 1 | 2 |
| 3 | 4 |
Args:
condition_expression: An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.
discard_matching_rows: By default the rows matching the condition (ie where the condition is True) are kept, the rest of the
rows being dropped from the result. Setting this parameter to True essentially inverts the condition, so the rows
matching the condition are discarded and the rest of the rows kept. Default: False.
named_output_discarded: A named output for the records being discarded if those need to be kept for further processing.
Default: None, which doesn't keep track of discarded records.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ExpressionSyntaxError: raised if the column expression has a Python syntax error.
TypeError: raised if an operation is not supported between the types involved
NameError: raised if an unknown variable is used
KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])
"""
EXCLUDE_FROM_COMPARE = ('_condition_expression', )
def __init__(self, condition_expression: str, discard_matching_rows: bool=False, named_output_discarded: Optional[str]=None,
named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert condition_expression, "condition_expression cannot be empty"
self.condition_expression = condition_expression
self.discard_matching_rows = discard_matching_rows
self.named_output_discarded = named_output_discarded
self._condition_expression = self.get_condition_expression()
IfThenElseRule (UnaryOpBaseRule)
¶
Calculates the ouput based on a condition (If Cond is true Then use then_value Else use else_value).
Example::
1 2 3 4 5 6 7 8 |
|
Result::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
condition_expression |
str |
An expression as a string. The expression must evaluate to a boolean scalar or a boolean series. |
required |
then_value |
Union[int, float, bool, str] |
The value to use if the condition is true. |
None |
then_column |
Optional[str] |
Use the value from the then_column if the condition is true. One and only one of then_value and then_column can be used. |
None |
else_value |
Union[int, float, bool, str] |
The value to use if the condition is false. |
None |
else_column |
Optional[str] |
Use the value from the else_column if the condition is false. One and only one of the else_value and else_column can be used. |
None |
output_column |
str |
The column name of the result column which will be added to the dataframe. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised in strict mode only if a column with the same name already exists in the dataframe. |
ExpressionSyntaxError |
raised if the column expression has a Python syntax error. |
MissingColumnError |
raised when then_column or else_column are used but they are missing from the input dataframe. |
TypeError |
raised if an operation is not supported between the types involved |
NameError |
raised if an unknown variable is used |
KeyError |
raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN']) |
Source code in etlrules/backends/common/conditions.py
class IfThenElseRule(UnaryOpBaseRule):
""" Calculates the ouput based on a condition (If Cond is true Then use then_value Else use else_value).
Example::
Given df:
| A | B |
| 1 | 2 |
| 5 | 3 |
| 3 | 4 |
rule = IfThenElseRule("df['A'] > df['B']", output_column="C", then_value="A is greater", else_value="B is greater")
rule.apply(df)
Result::
| A | B | C |
| 1 | 2 | B is greater |
| 5 | 3 | A is greater |
| 3 | 4 | B is greater |
Args:
condition_expression: An expression as a string. The expression must evaluate to a boolean scalar or a boolean series.
then_value: The value to use if the condition is true.
then_column: Use the value from the then_column if the condition is true.
One and only one of then_value and then_column can be used.
else_value: The value to use if the condition is false.
else_column: Use the value from the else_column if the condition is false.
One and only one of the else_value and else_column can be used.
output_column: The column name of the result column which will be added to the dataframe.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
ExpressionSyntaxError: raised if the column expression has a Python syntax error.
MissingColumnError: raised when then_column or else_column are used but they are missing from the input dataframe.
TypeError: raised if an operation is not supported between the types involved
NameError: raised if an unknown variable is used
KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])
"""
EXCLUDE_FROM_COMPARE = ('_condition_expression', )
def __init__(self, condition_expression: str, output_column: str, then_value: Optional[Union[int,float,bool,str]]=None, then_column: Optional[str]=None,
else_value: Optional[Union[int,float,bool,str]]=None, else_column: Optional[str]=None,
named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert bool(then_value is None) != bool(then_column is None), "One and only one of then_value and then_column can be specified."
assert bool(else_value is None) != bool(else_column is None), "One and only one of else_value and else_column can be specified."
assert condition_expression, "condition_expression cannot be empty"
assert output_column, "output_column cannot be empty"
self.condition_expression = condition_expression
self.output_column = output_column
self.then_value = then_value
self.then_column = then_column
self.else_value = else_value
self.else_column = else_column
self._condition_expression = self.get_condition_expression()
def _validate_columns(self, df_columns):
if self.strict and self.output_column in df_columns:
raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")
if self.then_column is not None and self.then_column not in df_columns:
raise MissingColumnError(f"Column {self.then_column} is missing from the input dataframe.")
if self.else_column is not None and self.else_column not in df_columns:
raise MissingColumnError(f"Column {self.else_column} is missing from the input dataframe.")
datetime
¶
DateTimeAddRule (BaseAssignColumnRule)
¶
Adds a number of units (days, hours, minutes, etc.) to a datetime column.
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The name of a datetime column to add to. |
required |
unit_value |
Union[int,float,str] |
The number of units to add to the datetime column. The unit_value can be negative, in which case this rule performs a substract. A name of an existing column can be passed into unit_value, in which case, that column will be added to the input_column. If the column is a timedelta, it will be added as is, if it's a numeric column, then it will be interpreted based on the unit parameter (e.g. years/days/hours/etc.). In this case, if the column specified in the unit_value doesn't exist, MissingColumnError is raised. |
required |
unit |
str |
Specifies what unit the unit_value is in. Supported values are: years, months, weeks, weekdays, days, hours, minutes, seconds, microseconds, nanoseconds. weekdays skips weekends (ie Saturdays and Sundays). |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column doesn't exist in the input dataframe. |
MissingColumnError |
raised if unit_value is a name of a column but it doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
ValueError |
raised if unit_value is a column which is not a timedelta column and the unit parameter is not specified. |
Note
In non-strict mode, missing columns or overwriting existing columns are ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeAddRule(BaseAssignColumnRule):
""" Adds a number of units (days, hours, minutes, etc.) to a datetime column.
Basic usage::
# adds 2 days the A column
rule = DateTimeAddRule("A", 2, "days")
rule.apply(data)
# adds 2 hours to the A column
rule = DateTimeAddRule("A", 2, "hours")
rule.apply(data)
Args:
input_column (str): The name of a datetime column to add to.
unit_value (Union[int,float,str]): The number of units to add to the datetime column.
The unit_value can be negative, in which case this rule performs a substract.
A name of an existing column can be passed into unit_value, in which case, that
column will be added to the input_column.
If the column is a timedelta, it will be added as is, if it's a numeric column,
then it will be interpreted based on the unit parameter (e.g. years/days/hours/etc.).
In this case, if the column specified in the unit_value doesn't exist,
MissingColumnError is raised.
unit (str): Specifies what unit the unit_value is in. Supported values are:
years, months, weeks, weekdays, days, hours, minutes, seconds, microseconds, nanoseconds.
weekdays skips weekends (ie Saturdays and Sundays).
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
MissingColumnError: raised if unit_value is a name of a column but it doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
ValueError: raised if unit_value is a column which is not a timedelta column and the unit parameter is not specified.
Note:
In non-strict mode, missing columns or overwriting existing columns are ignored.
"""
def __init__(self, input_column: str, unit_value: Union[int, float, str],
unit: Optional[Literal["years", "months", "weeks", "weekdays", "days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds"]],
output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.unit_value = unit_value
if not isinstance(self.unit_value, str):
assert unit in DT_ARITHMETIC_UNITS, f"Unsupported unit: '{unit}'. It must be one of {DT_ARITHMETIC_UNITS}"
self.unit = unit
DateTimeDiffRule (BaseAssignColumnRule)
¶
Calculates the difference between two datetime columns, optionally extracting it in the specified unit.
Basic usage::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The name of a datetime column. |
required |
input_column2 |
str |
The name of the second datetime column. The result will be input_column - input_column2 |
required |
unit |
Optional[str] |
If specified, it will extract the given component of the difference: years, months, days, hours, minutes, seconds, microseconds, nanoseconds. |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if either input_column or input_column2 don't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
For best results, round the datetime columns using one of the rounding rules before calculating the difference. Otherwise, this rule will tend to truncate/round down. For example: 2023-05-05 10:00:00 - 2023-05-04 10:00:01 will result in 0 days even though the difference is 23:59:59. In cases like this one, it might be preferable to round, in this case perhaps round to "day" using DateTimeRoundRule or DateTimeRoundDownRule. This will result in a 2023-05-05 00:00:00 - 2023-05-04 00:00:00 which results in 1 day.
Source code in etlrules/backends/common/datetime.py
class DateTimeDiffRule(BaseAssignColumnRule):
""" Calculates the difference between two datetime columns, optionally extracting it in the specified unit.
Basic usage::
# calculates the A - B in days
rule = DateTimeDiffRule("A", "B", unit="days")
rule.apply(data)
Args:
input_column (str): The name of a datetime column.
input_column2 (str): The name of the second datetime column.
The result will be input_column - input_column2
unit (Optional[str]): If specified, it will extract the given component of the difference:
years, months, days, hours, minutes, seconds, microseconds, nanoseconds.
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if either input_column or input_column2 don't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
For best results, round the datetime columns using one of the rounding rules before
calculating the difference. Otherwise, this rule will tend to truncate/round down.
For example: 2023-05-05 10:00:00 - 2023-05-04 10:00:01 will result in 0 days even though
the difference is 23:59:59. In cases like this one, it might be preferable to round, in this
case perhaps round to "day" using DateTimeRoundRule or DateTimeRoundDownRule. This will result
in a 2023-05-05 00:00:00 - 2023-05-04 00:00:00 which results in 1 day.
"""
SUPPORTED_COMPONENTS = {
"days", "hours", "minutes", "seconds",
"microseconds", "nanoseconds", "total_seconds",
}
SIGN = -1
EXCLUDE_FROM_SERIALIZE = ('unit_value', )
def __init__(self, input_column: str, input_column2: str,
unit: Optional[Literal["days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds", "total_seconds"]],
output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
assert input_column2 and isinstance(input_column2, str), "input_column2 must be a non-empty string representing the name of a column."
assert unit is None or unit in self.SUPPORTED_COMPONENTS, f"unit must be None of one of: {self.SUPPORTED_COMPONENTS}"
super().__init__(input_column=input_column, output_column=output_column,
named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.input_column2 = input_column2
self.unit = unit
DateTimeExtractComponentRule (BaseAssignColumnRule)
¶
Extract an individual component of a date/time (e.g. year, month, day, hour, etc.).
Basic usage::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A datetime column to extract the given component from. |
required |
component |
str |
The component of the datatime to extract from the datetime. When the component is one of (year, month, day, hour, minute, second, microsecond) then the extracted component will be an integer with the respective component of the datetime. When component is weekday, the component will be an integer with the values 0-6, with Monday being 0 and Sunday 6. When the component is weekday_name or month_name, the result column will be a string column with the names of the weekdays (e.g. Monday, Tuesday, etc.) or month names respectively (e.g. January, February, etc.). The names will be printed in the language specified in the locale parameter (or English as the default). |
required |
locale |
Optional[str] |
An optional locale string applicable to weekday_name and month_name. When specified,
the names will use the given locale to print the names in the given language.
Default: en_US.utf8 will print the names in English.
Use the command |
required |
output_column |
Optional[str] |
An optional column name to contain the result. If provided, if must have the same length as the columns sequence. The existing columns are unchanged, and new columns are created with the component extracted. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
ValueError |
raised if a locale is specified which is not supported or available on the machine running the scripts. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeExtractComponentRule(BaseAssignColumnRule):
""" Extract an individual component of a date/time (e.g. year, month, day, hour, etc.).
Basic usage::
# extracts the year component from col_A. E.g. 2023-05-05 10:00:00 will extract 2023
rule = DateTimeExtractComponentRule("col_A", component="year")
rule.apply(data)
Args:
input_column (str): A datetime column to extract the given component from.
component: The component of the datatime to extract from the datetime.
When the component is one of (year, month, day, hour, minute, second, microsecond) then
the extracted component will be an integer with the respective component of the datetime.
When component is weekday, the component will be an integer with the values 0-6, with
Monday being 0 and Sunday 6.
When the component is weekday_name or month_name, the result column will be a string
column with the names of the weekdays (e.g. Monday, Tuesday, etc.) or month names
respectively (e.g. January, February, etc.). The names will be printed in the language
specified in the locale parameter (or English as the default).
locale: An optional locale string applicable to weekday_name and month_name. When specified,
the names will use the given locale to print the names in the given language.
Default: en_US.utf8 will print the names in English.
Use the command `locale -a` on your terminal on Unix systems to find your locale language code.
Trying to set the locale to a value that doesn't appear under the `locale -a` output will fail
with ValueError: Unsupported locale.
output_column (Optional[str]): An optional column name to contain the result.
If provided, if must have the same length as the columns sequence.
The existing columns are unchanged, and new columns are created with the component extracted.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
ValueError: raised if a locale is specified which is not supported or available on the machine running the scripts.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
SUPPORTED_COMPONENTS = {
"year", "month", "day", "hour", "minute", "second",
"microsecond", "nanosecond",
"weekday", "day_name", "month_name",
}
def __init__(self, input_column: str, component: str, locale: Optional[str], output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
self.component = component
assert self.component in self.SUPPORTED_COMPONENTS, f"Unsupported component={self.component}. Must be one of: {self.SUPPORTED_COMPONENTS}"
self.locale = locale
self._locale = self.locale
if self.locale and self._cannot_set_locale(locale):
if self.strict:
raise ValueError(f"Unsupported locale: {locale}")
self._locale = None
def _cannot_set_locale(self, locale):
return False
DateTimeLocalNowRule (UnaryOpBaseRule)
¶
Adds a new column with the local date/time.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_column |
The name of the column to be added to the dataframe. This column will be populated with the local date/time at the time of the call. The same value will be populated for all rows. The date/time populated is a "naive" datetime ie: doesn't have a timezone information. |
required | |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the input dataframe. |
Note
In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
Source code in etlrules/backends/common/datetime.py
class DateTimeLocalNowRule(UnaryOpBaseRule):
""" Adds a new column with the local date/time.
Basic usage::
rule = DateTimeLocalNowRule(output_column="LocalTimeNow")
rule.apply(data)
Args:
output_column: The name of the column to be added to the dataframe.
This column will be populated with the local date/time at the time of the call.
The same value will be populated for all rows.
The date/time populated is a "naive" datetime ie: doesn't have a timezone information.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the input dataframe.
Note:
In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
"""
def __init__(self, output_column, named_input:Optional[str]=None, named_output:Optional[str]=None, name:Optional[str]=None, description:Optional[str]=None, strict:bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert output_column and isinstance(output_column, str)
self.output_column = output_column
DateTimeRoundDownRule (BaseAssignColumnRule)
¶
Rounds down (truncates) a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The column name to round according to the unit specified. |
required |
unit |
str |
Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc. The supported units are: day: removes the hours/minutes/etc. hour: removes the minutes/seconds etc. minute: removes the seconds/etc. second: removes the milliseconds/etc. millisecond: removes the microseconds microsecond: removes nanoseconds (if any) |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeRoundDownRule(BaseAssignColumnRule):
""" Rounds down (truncates) a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
# rounds the A column to the nearest second
rule = DateTimeRoundDownRule("A", "second")
rule.apply(data)
# rounds the A column to days
rule = DateTimeRoundDownRule("A", "day")
rule.apply(data)
Args:
input_column (str): The column name to round according to the unit specified.
unit (str): Specifies the unit of rounding.
That is: rounding to the nearest day, hour, minute, etc.
The supported units are:
day: removes the hours/minutes/etc.
hour: removes the minutes/seconds etc.
minute: removes the seconds/etc.
second: removes the milliseconds/etc.
millisecond: removes the microseconds
microsecond: removes nanoseconds (if any)
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
self.unit = unit
DateTimeRoundRule (BaseAssignColumnRule)
¶
Rounds a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The column name to round according to the unit specified. |
required |
unit |
str |
Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc. The supported units are: day: anything up to 12:00:00 rounds down to the current day, after that up to the next day hour: anything up to 30th minute rounds down to the current hour, after that up to the next hour minute: anything up to 30th second rounds down to the current minute, after that up to the next minute second: rounds to the nearest second (if the column has milliseconds) millisecond: rounds to the nearest millisecond (if the column has microseconds) microsecond: rounds to the nearest microsecond nanosecond: rounds to the nearest nanosecond |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeRoundRule(BaseAssignColumnRule):
""" Rounds a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
# rounds the A column to the nearest second
rule = DateTimeRoundRule("A", "second")
rule.apply(data)
# rounds the A column to days
rule = DateTimeRoundRule("A", "day")
rule.apply(data)
Args:
input_column (str): The column name to round according to the unit specified.
unit (str): Specifies the unit of rounding.
That is: rounding to the nearest day, hour, minute, etc.
The supported units are:
day: anything up to 12:00:00 rounds down to the current day, after that up to the next day
hour: anything up to 30th minute rounds down to the current hour, after that up to the next hour
minute: anything up to 30th second rounds down to the current minute, after that up to the next minute
second: rounds to the nearest second (if the column has milliseconds)
millisecond: rounds to the nearest millisecond (if the column has microseconds)
microsecond: rounds to the nearest microsecond
nanosecond: rounds to the nearest nanosecond
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
self.unit = unit
DateTimeRoundUpRule (BaseAssignColumnRule)
¶
Rounds up a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The column name to round according to the unit specified. |
required |
unit |
str |
Specifies the unit of rounding. That is: rounding to the nearest day, hour, minute, etc. The supported units are: day: Rounds up to the next day if there are any hours/minutes/etc. hour: Rounds up to the next hour if there are any minutes/etc. minute: Rounds up to the next minute if there are any seconds/etc. second: Rounds up to the next second if there are any milliseconds/etc. millisecond: Rounds up to the next millisecond if there are any microseconds microsecond: Rounds up to the next microsecond if there are any nanoseconds |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeRoundUpRule(BaseAssignColumnRule):
""" Rounds up a set of datetime columns to the specified granularity (day, hour, minute, etc.).
Basic usage::
# rounds the A column to the nearest second
rule = DateTimeRoundUpRule("A", "second")
rule.apply(data)
# rounds A column to days
rule = DateTimeRoundUpRule("A", "day")
rule.apply(data)
Args:
input_column (str): The column name to round according to the unit specified.
unit (str): Specifies the unit of rounding.
That is: rounding to the nearest day, hour, minute, etc.
The supported units are:
day: Rounds up to the next day if there are any hours/minutes/etc.
hour: Rounds up to the next hour if there are any minutes/etc.
minute: Rounds up to the next minute if there are any seconds/etc.
second: Rounds up to the next second if there are any milliseconds/etc.
millisecond: Rounds up to the next millisecond if there are any microseconds
microsecond: Rounds up to the next microsecond if there are any nanoseconds
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
def __init__(self, input_column: str, unit: str, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert isinstance(unit, str) and unit in ROUND_TRUNC_UNITS, f"unit must be one of {ROUND_TRUNC_UNITS} and not '{unit}'"
self.unit = unit
DateTimeSubstractRule (BaseAssignColumnRule)
¶
Substracts a number of units (days, hours, minutes, etc.) from a datetime column.
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The name of a datetime column to add to. |
required |
unit_value |
Union[int,float,str] |
The number of units to add to the datetime column. The unit_value can be negative, in which case this rule performs an addition. A name of an existing column can be passed into unit_value, in which case, that column will be substracted from the input_column. If the column is a timedelta, it will be substracted as is, if it's a numeric column, then it will be interpreted based on the unit parameter (e.g. days/hours/etc.). In this case, if the column specified in the unit_value doesn't exist, MissingColumnError is raised. |
required |
unit |
str |
Specifies what unit the unit_value is in. Supported values are: days, hours, minutes, seconds, microseconds, nanoseconds. |
required |
output_column |
Optional[str] |
The name of a new column with the result. Optional. If not provided, the result is updated in place. In strict mode, if provided, the output_column must not exist in the input dataframe. In non-strict mode, if provided, the output_column with overwrite a column with the same name in the input dataframe (if any). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column doesn't exist in the input dataframe. |
MissingColumnError |
raised if unit_value is a name of a column but it doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, missing columns or overwriting existing columns are ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeSubstractRule(BaseAssignColumnRule):
""" Substracts a number of units (days, hours, minutes, etc.) from a datetime column.
Basic usage::
# substracts 2 days the A column
rule = DateTimeSubstractRule("A", 2, "days")
rule.apply(data)
# substracts 2 hours to the A column
rule = DateTimeSubstractRule("A", 2, "hours")
rule.apply(data)
Args:
input_column (str): The name of a datetime column to add to.
unit_value (Union[int,float,str]): The number of units to add to the datetime column.
The unit_value can be negative, in which case this rule performs an addition.
A name of an existing column can be passed into unit_value, in which case, that
column will be substracted from the input_column.
If the column is a timedelta, it will be substracted as is, if it's a numeric column,
then it will be interpreted based on the unit parameter (e.g. days/hours/etc.).
In this case, if the column specified in the unit_value doesn't exist,
MissingColumnError is raised.
unit (str): Specifies what unit the unit_value is in. Supported values are:
days, hours, minutes, seconds, microseconds, nanoseconds.
output_column (Optional[str]): The name of a new column with the result. Optional.
If not provided, the result is updated in place.
In strict mode, if provided, the output_column must not exist in the input dataframe.
In non-strict mode, if provided, the output_column with overwrite a column with
the same name in the input dataframe (if any).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
MissingColumnError: raised if unit_value is a name of a column but it doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, missing columns or overwriting existing columns are ignored.
"""
def __init__(self, input_column: str, unit_value: Union[int, float, str],
unit: Optional[Literal["years", "months", "weeks", "weekdays", "days", "hours", "minutes", "seconds", "milliseconds", "microseconds", "nanoseconds"]],
output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.unit_value = unit_value
if not isinstance(self.unit_value, str):
assert unit in DT_ARITHMETIC_UNITS, f"Unsupported unit: '{unit}'. It must be one of {DT_ARITHMETIC_UNITS}"
self.unit = unit
DateTimeToStrFormatRule (BaseAssignColumnRule)
¶
Formats a datetime column to a string representation according to a specified format.
Basic usage::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The datetime column with the values to format to string. |
required |
format |
str |
The format used to display the date/time. E.g. %Y-%m-%d For the directives accepted in the format, have a look at: https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior |
required |
output_column |
Optional[str] |
An optional column to hold the formatted results. If provided, the existing column is unchanged, and a new column with this new is created. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, overwriting existing columns is ignored.
Source code in etlrules/backends/common/datetime.py
class DateTimeToStrFormatRule(BaseAssignColumnRule):
""" Formats a datetime column to a string representation according to a specified format.
Basic usage::
# displays the dates in column col_A in the %Y-%m-%d format, e.g. 2023-05-19
rule = DateTimeToStrFormatRule("col_A", format="%Y-%m-%d")
rule.apply(data)
Args:
input_column (str): The datetime column with the values to format to string.
format: The format used to display the date/time.
E.g. %Y-%m-%d
For the directives accepted in the format, have a look at:
https://docs.python.org/3/library/datetime.html#strftime-and-strptime-behavior
output_column (Optional[str]): An optional column to hold the formatted results.
If provided, the existing column is unchanged, and a new column with this new
is created.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, overwriting existing columns is ignored.
"""
def __init__(self, input_column: str, format: str, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
self.format = format
DateTimeUTCNowRule (UnaryOpBaseRule)
¶
Adds a new column with the UTC date/time.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_column |
The name of the column to be added to the dataframe. This column will be populated with the UTC date/time at the time of the call. The same value will be populated for all rows. The date/time populated is a "naive" datetime ie: doesn't have a timezone information. |
required | |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the input dataframe. |
Note
In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
Source code in etlrules/backends/common/datetime.py
class DateTimeUTCNowRule(UnaryOpBaseRule):
""" Adds a new column with the UTC date/time.
Basic usage::
rule = DateTimeUTCNowRule(output_column="UTCTimeNow")
rule.apply(data)
Args:
output_column: The name of the column to be added to the dataframe.
This column will be populated with the UTC date/time at the time of the call.
The same value will be populated for all rows.
The date/time populated is a "naive" datetime ie: doesn't have a timezone information.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the input dataframe.
Note:
In non-strict mode, if the output_column exists in the input dataframe, it will be overwritten.
"""
def __init__(self, output_column, named_input:Optional[str]=None, named_output:Optional[str]=None, name:Optional[str]=None, description:Optional[str]=None, strict:bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert output_column and isinstance(output_column, str)
self.output_column = output_column
fill
¶
BackFillRule (BaseFillRule)
¶
Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.
Example::
1 2 3 4 |
|
After a fill forward::
1 2 3 4 |
|
After a fill forward with group_by=["A"]::
1 2 3 4 |
|
The "a" group has no non-NA value, so it is not filled. The "b" group has a non-NA value of 2 but not other NA values, so nothing to fill.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Iterable[str] |
The list of columns to replaces NAs for. The rest of the columns in the dataframe are not affected. |
required |
sort_by |
Optional[Iterable[str]] |
The list of columns to sort by before the fill operation. Optional. Given the previous non-NA values are used, sorting can make a difference in the values uses. |
required |
sort_ascending |
bool |
When sort_by is specified, True means sort ascending, False sort descending. |
required |
group_by |
Optional[Iterable[str]] |
The list of columns to group by before the fill operation. Optional. The fill values are only used within a group, other adjacent groups are not filled. Useful when you want to copy(fill) data at a certain group level. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe. |
Source code in etlrules/backends/common/fill.py
class BackFillRule(BaseFillRule):
""" Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.
Example::
| A | B |
| a | NA |
| b | 2 |
| a | NA |
After a fill forward::
| A | B |
| a | 2 |
| b | 2 |
| a | NA |
After a fill forward with group_by=["A"]::
| A | B |
| a | NA |
| b | 2 |
| a | NA |
The "a" group has no non-NA value, so it is not filled.
The "b" group has a non-NA value of 2 but not other NA values, so nothing to fill.
Args:
columns (Iterable[str]): The list of columns to replaces NAs for.
The rest of the columns in the dataframe are not affected.
sort_by (Optional[Iterable[str]]): The list of columns to sort by before the fill operation. Optional.
Given the previous non-NA values are used, sorting can make a difference in the values uses.
sort_ascending (bool): When sort_by is specified, True means sort ascending, False sort descending.
group_by (Optional[Iterable[str]]): The list of columns to group by before the fill operation. Optional.
The fill values are only used within a group, other adjacent groups are not filled.
Useful when you want to copy(fill) data at a certain group level.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.
"""
BaseFillRule (UnaryOpBaseRule)
¶
Source code in etlrules/backends/common/fill.py
class BaseFillRule(UnaryOpBaseRule):
FILL_METHOD = None
def __init__(self, columns: Iterable[str], sort_by: Optional[Iterable[str]]=None, sort_ascending: bool=True, group_by: Optional[Iterable[str]]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
assert self.FILL_METHOD is not None
assert columns, "Columns need to be specified for fill rules."
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.columns = [col for col in columns]
assert all(isinstance(col, str) for col in self.columns), "All columns must be strings in fill rules."
self.sort_by = sort_by
if self.sort_by is not None:
self.sort_by = [col for col in self.sort_by]
assert all(isinstance(col, str) for col in self.sort_by), "All sort_by columns must be strings in fill rules when specified."
self.sort_ascending = sort_ascending
self.group_by = group_by
if self.group_by is not None:
self.group_by = [col for col in self.group_by]
assert all(isinstance(col, str) for col in self.group_by), "All group_by columns must be strings in fill rules when specified."
def do_apply(self, df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
df_columns = [col for col in df.columns]
if self.sort_by:
if not set(self.sort_by) <= set(df_columns):
raise MissingColumnError(f"Missing sort_by column(s) in fill operation: {set(self.sort_by) - set(df_columns)}")
if self.group_by:
if not set(self.group_by) <= set(df_columns):
raise MissingColumnError(f"Missing group_by column(s) in fill operation: {set(self.group_by) - set(df_columns)}")
df = self.do_apply(df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/fill.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
df_columns = [col for col in df.columns]
if self.sort_by:
if not set(self.sort_by) <= set(df_columns):
raise MissingColumnError(f"Missing sort_by column(s) in fill operation: {set(self.sort_by) - set(df_columns)}")
if self.group_by:
if not set(self.group_by) <= set(df_columns):
raise MissingColumnError(f"Missing group_by column(s) in fill operation: {set(self.group_by) - set(df_columns)}")
df = self.do_apply(df)
self._set_output_df(data, df)
ForwardFillRule (BaseFillRule)
¶
Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.
Example::
1 2 3 4 |
|
After a fill forward::
1 2 3 4 |
|
After a fill forward with group_by=["A"]::
1 2 3 4 |
|
The "a" group has the first non-NA value as 1 and that is used "forward" to fill the 3rd row. The "b" group has no non-NA values, so nothing to fill.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
columns |
Iterable[str] |
The list of columns to replaces NAs for. The rest of the columns in the dataframe are not affected. |
required |
sort_by |
Optional[Iterable[str]] |
The list of columns to sort by before the fill operation. Optional. Given the previous non-NA values are used, sorting can make a difference in the values uses. |
required |
sort_ascending |
bool |
When sort_by is specified, True means sort ascending, False sort descending. |
required |
group_by |
Optional[Iterable[str]] |
The list of columns to group by before the fill operation. Optional. The fill values are only used within a group, other adjacent groups are not filled. Useful when you want to copy(fill) data at a certain group level. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe. |
Source code in etlrules/backends/common/fill.py
class ForwardFillRule(BaseFillRule):
""" Replaces NAs/missing values with the next non-NA value, optionally sorting and grouping the data.
Example::
| A | B |
| a | 1 |
| b | NA |
| a | NA |
After a fill forward::
| A | B |
| a | 1 |
| b | 1 |
| a | 1 |
After a fill forward with group_by=["A"]::
| A | B |
| a | 1 |
| b | NA |
| a | 1 |
The "a" group has the first non-NA value as 1 and that is used "forward" to fill the 3rd row.
The "b" group has no non-NA values, so nothing to fill.
Args:
columns (Iterable[str]): The list of columns to replaces NAs for.
The rest of the columns in the dataframe are not affected.
sort_by (Optional[Iterable[str]]): The list of columns to sort by before the fill operation. Optional.
Given the previous non-NA values are used, sorting can make a difference in the values uses.
sort_ascending (bool): When sort_by is specified, True means sort ascending, False sort descending.
group_by (Optional[Iterable[str]]): The list of columns to group by before the fill operation. Optional.
The fill values are only used within a group, other adjacent groups are not filled.
Useful when you want to copy(fill) data at a certain group level.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description Optional[str]: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns specified in either columns, sort_by or group_by are missing from the dataframe.
"""
io
special
¶
db
¶
ReadSQLQueryRule (BaseRule)
¶
Runs a SQL query and reads the results back into a dataframe.
Basic usage::
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql_engine |
str |
A sqlalchemy engine string. This is typically in the form: dialect+driver://username:password@host:port/database For more information, please refer to the sqlalchemy documentation here: https://docs.sqlalchemy.org/en/20/core/engines.html In order to support users and passwords in the sql_engine string, substitutions of environment variables is supported using the {env.VARIABLE_NAME} form. For example, adding the USER and PASSWORD environment variables in the sql string could be done as: sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective environment variables, allowing you to not hardcode them in the plan for security reasons but also for configurability. A similar substition can be achieved using the plan context using the context.property, e.g. sql_engine = "postgres://{context.USER}:{env.PASSWORD}@{context.DB_HOST}/mydb It's not recommended to store passwords in plain text in the plan. |
required |
sql_query |
str |
A SQL SELECT statement that will specify the columns, table and optionally any WHERE, GROUP BY, ORDER BY clauses. The SQL statement must be valid for the SQL engine specified in the sql_engine parameter. The env and context substitution work in the sql_query too. E.g.: SELECT * from {env.SCHEMA}.{context.TABLE_NAME} WHERE {context.FILTER} This allows you to parameterize the plan at run time. |
required |
column_types |
Optional[Mapping[str, str]] |
A mapping of column names and their types. Column types are inferred from the data when this parameter is not specified. For empty result sets, this inferrence is not possible, so specifying the column types allows the users to control the types in that scenario and not fallback onto backends defaults. |
None |
batch_size |
int |
An optional batch size (number of rows) to use when reading the results. Defaults: 50000. Some backends ignore this option, otherwise use it to partition the data. |
50000 |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
SQLError |
raised if there's an error running the sql statement. |
UnsupportedTypeError |
raised if column_types are specified and any of them are not supported. |
Note
The implementation uses sqlalchemy, which must be installed as an optional dependency of etlrules.
Source code in etlrules/backends/common/io/db.py
class ReadSQLQueryRule(BaseRule):
""" Runs a SQL query and reads the results back into a dataframe.
Basic usage::
# reads all the data from a sqlite db called mydb.db, from the table MyTable
# saves the dataframe as the main output of the rule which subsequent rules can use as their main input
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable")
rule.apply(data)
# reads all the data from a sqlite db called mydb.db, from the table MyTable
# saves the dataframe as the a named output called MyData which subsequent rules can use by name
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", named_output="MyData")
rule.apply(data)
# same as the first example, but uses column types rather than relying on type inferrence
rule = ReadSQLQueryRule("sqlite:///mydb.db", "SELECT * FROM MyTable", column_types={"ColA": "int64", "ColB": "string"})
rule.apply(data)
Args:
sql_engine: A sqlalchemy engine string. This is typically in the form:
dialect+driver://username:password@host:port/database
For more information, please refer to the sqlalchemy documentation here:
https://docs.sqlalchemy.org/en/20/core/engines.html
In order to support users and passwords in the sql_engine string, substitutions of environment variables
is supported using the {env.VARIABLE_NAME} form.
For example, adding the USER and PASSWORD environment variables in the sql string could be done as:
sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb
In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective
environment variables, allowing you to not hardcode them in the plan for security reasons but also for
configurability.
A similar substition can be achieved using the plan context using the context.property, e.g.
sql_engine = "postgres://{context.USER}:{env.PASSWORD}@{context.DB_HOST}/mydb
It's not recommended to store passwords in plain text in the plan.
sql_query: A SQL SELECT statement that will specify the columns, table and optionally any WHERE, GROUP BY, ORDER BY clauses.
The SQL statement must be valid for the SQL engine specified in the sql_engine parameter.
The env and context substitution work in the sql_query too. E.g.:
SELECT * from {env.SCHEMA}.{context.TABLE_NAME} WHERE {context.FILTER}
This allows you to parameterize the plan at run time.
column_types: A mapping of column names and their types. Column types are inferred from the data when this parameter
is not specified. For empty result sets, this inferrence is not possible, so specifying the column types allows
the users to control the types in that scenario and not fallback onto backends defaults.
batch_size: An optional batch size (number of rows) to use when reading the results. Defaults: 50000.
Some backends ignore this option, otherwise use it to partition the data.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
SQLError: raised if there's an error running the sql statement.
UnsupportedTypeError: raised if column_types are specified and any of them are not supported.
Note:
The implementation uses sqlalchemy, which must be installed as an optional dependency of etlrules.
"""
def __init__(self, sql_engine: str, sql_query: str, column_types: Optional[Mapping[str, str]]=None, batch_size: int=50_000, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_output=named_output, name=name, description=description, strict=strict)
self.sql_engine = sql_engine
self.sql_query = sql_query
if not self.sql_engine or not isinstance(self.sql_engine, str):
raise ValueError("The sql_engine parameter must be a non-empty string.")
if not self.sql_query or not isinstance(self.sql_query, str):
raise ValueError("The sql_query parameter must be a non-empty string.")
self.column_types = column_types
self._validate_column_types()
self.batch_size = batch_size
def _validate_column_types(self):
if self.column_types is not None:
for column, column_type in self.column_types.items():
if column_type not in SUPPORTED_TYPES:
raise UnsupportedTypeError(f"Type '{column_type}' for column '{column}' is not supported.")
def has_input(self):
return False
def _do_apply(self, connection):
raise NotImplementedError("Can't instantiate base class.")
def _get_sql_engine(self) -> str:
sql_engine = subst_string(self.sql_engine)
if not sql_engine:
raise ValueError("The sql_engine parameter must be a non-empty string.")
return sql_engine
def _get_sql_query(self) -> str:
sql_query = subst_string(self.sql_query)
if not sql_query:
raise ValueError("The sql_query parameter must be a non-empty string.")
return sql_query
def apply(self, data):
super().apply(data)
sql_engine = self._get_sql_engine()
engine = SQLAlchemyEngines.get_engine(sql_engine)
with engine.connect() as connection:
try:
result = self._do_apply(connection)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
self._set_output_df(data, result)
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/io/db.py
def apply(self, data):
super().apply(data)
sql_engine = self._get_sql_engine()
engine = SQLAlchemyEngines.get_engine(sql_engine)
with engine.connect() as connection:
try:
result = self._do_apply(connection)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
self._set_output_df(data, result)
has_input(self)
¶Returns True if the rule needs a dataframe input to operate on, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.
Source code in etlrules/backends/common/io/db.py
def has_input(self):
return False
WriteSQLTableRule (UnaryOpBaseRule)
¶
Writes the data from the input dataframe into a SQL table in a database.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to the DB. If the named_input is specified, the input with that name is written, otherwise, it takes the main output of the preceding rule.
Basic usage::
1 2 3 4 5 6 7 8 9 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
sql_engine |
str |
A sqlalchemy engine string. This is typically in the form: dialect+driver://username:password@host:port/database For more information, please refer to the sqlalchemy documentation here: https://docs.sqlalchemy.org/en/20/core/engines.html In order to support users and passwords in the sql_engine string, substitutions of environment variables is supported using the {env.VARIABLE_NAME} form. For example, adding the USER and PASSWORD environment variables in the sql string could be done as: sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective environment variables, allowing you to not hardcode them in the plan for security reasons but also for configurability. |
required |
sql_table |
str |
The name of the sql table to write to. |
required |
if_exists |
str |
Specifies what to do in case the table already exists in the database. The options are: - replace: drops all the existing data and inserts the data in the input dataframe - append: adds the data in the input dataframe to the existing data in the table - fail: Raises a ValueError exception Default: fail. |
'fail' |
named_input |
Optional[str] |
Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True. |
True |
Exceptions:
Type | Description |
---|---|
ValueError |
raised if the table already exists when the if_exists is fail. ValueError is also raised if any of the arguments passed into the rule are not strings or empty strings. |
SQLError |
raised if there's any problem writing the data into the database. For example: If the schema doesn't match the schema of the table written to (for existing tables). |
Source code in etlrules/backends/common/io/db.py
class WriteSQLTableRule(UnaryOpBaseRule):
""" Writes the data from the input dataframe into a SQL table in a database.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing
outputs and writes it to the DB. If the named_input is specified, the input with that name
is written, otherwise, it takes the main output of the preceding rule.
Basic usage::
# writes the main input to a table called MyTable in a sqlite DB called mydb.db
# If the table already exists, it replaces it
rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="replace")
rule.apply(data)
# writes the dataframe input called 'input_data' to a table MyTable in a sqlite DB mydb.db
# If the table already exists, it appends the data to it
rule = WriteSQLTableRule("sqlite:///mydb.db", "MyTable", if_exists="append", named_input="input_data")
rule.apply(data)
Args:
sql_engine: A sqlalchemy engine string. This is typically in the form:
dialect+driver://username:password@host:port/database
For more information, please refer to the sqlalchemy documentation here:
https://docs.sqlalchemy.org/en/20/core/engines.html
In order to support users and passwords in the sql_engine string, substitutions of environment variables
is supported using the {env.VARIABLE_NAME} form.
For example, adding the USER and PASSWORD environment variables in the sql string could be done as:
sql_engine = "postgres://{env.USER}:{env.PASSWORD}@{env.DB_HOST}/mydb
In this example, when you run, env.USER, env.PASSWORD and env.DB_HOST will be replaced with the respective
environment variables, allowing you to not hardcode them in the plan for security reasons but also for
configurability.
sql_table: The name of the sql table to write to.
if_exists: Specifies what to do in case the table already exists in the database.
The options are:
- replace: drops all the existing data and inserts the data in the input dataframe
- append: adds the data in the input dataframe to the existing data in the table
- fail: Raises a ValueError exception
Default: fail.
named_input (Optional[str]): Select by name the dataframe to write from the input data.
Optional. When not specified, the main output of the previous rule will be written.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True.
Raises:
ValueError: raised if the table already exists when the if_exists is fail.
ValueError is also raised if any of the arguments passed into the rule are not strings or empty strings.
SQLError: raised if there's any problem writing the data into the database.
For example: If the schema doesn't match the schema of the table written to (for existing tables).
"""
class IF_EXISTS_OPTIONS:
APPEND = 'append'
REPLACE = 'replace'
FAIL = 'fail'
ALL_IF_EXISTS_OPTIONS = {IF_EXISTS_OPTIONS.APPEND, IF_EXISTS_OPTIONS.REPLACE, IF_EXISTS_OPTIONS.FAIL}
EXCLUDE_FROM_SERIALIZE = ("named_output", )
def __init__(self, sql_engine: str, sql_table: str, if_exists: str='fail', named_input=None, name=None, description=None, strict=True):
super().__init__(named_input=named_input, named_output=None, name=name, description=description, strict=strict)
self.sql_engine = sql_engine
if not self.sql_engine or not isinstance(self.sql_engine, str):
raise ValueError("The sql_engine parameter must be a non-empty string.")
self.sql_table = sql_table
if not self.sql_table or not isinstance(self.sql_table, str):
raise ValueError("The sql_table parameter must be a non-empty string.")
self.if_exists = if_exists
if self.if_exists not in self.ALL_IF_EXISTS_OPTIONS:
raise ValueError(f"'{if_exists}' is not a valid value for the if_exists parameter. It must be one of: '{self.ALL_IF_EXISTS_OPTIONS}'")
def has_output(self):
return False
def _get_sql_engine(self) -> str:
sql_engine = subst_string(self.sql_engine)
if not sql_engine:
raise ValueError("The sql_engine parameter must be a non-empty string.")
return sql_engine
def _get_sql_table(self) -> str:
sql_table = subst_string(self.sql_table)
if not sql_table:
raise ValueError("The sql_table parameter must be a non-empty string.")
return sql_table
has_output(self)
¶Returns True if the rule produces a dataframe, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.
Source code in etlrules/backends/common/io/db.py
def has_output(self):
return False
files
¶
BaseReadFileRule (BaseRule)
¶
Source code in etlrules/backends/common/io/files.py
class BaseReadFileRule(BaseRule):
def __init__(self, file_name: str, file_dir: Optional[str]=None, regex: bool=False, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_output=named_output, name=name, description=description, strict=strict)
self.file_name = file_name
self.file_dir = file_dir
self.regex = bool(regex)
if self._is_uri() and self.regex:
raise ValueError("Regex read not supported for URIs.")
def _is_uri(self):
file_name = self.file_name.lower()
return file_name.startswith("http://") or file_name.startswith("https://")
def has_input(self):
return False
def _get_full_file_paths(self):
file_name = subst_string(self.file_name)
file_dir = subst_string(self.file_dir or "")
if self.regex:
pattern = re.compile(file_name)
for fn in os.listdir(file_dir):
if pattern.match(fn):
yield os.path.join(file_dir, fn)
else:
if self._is_uri():
yield file_name
else:
yield os.path.join(file_dir, file_name)
def do_read(self, file_path: str):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def do_concat(self, left_df, right_df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
result = None
for file_path in self._get_full_file_paths():
df = self.do_read(file_path)
if result is None:
result = df
else:
result = self.do_concat(result, df)
self._set_output_df(data, result)
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/io/files.py
def apply(self, data):
super().apply(data)
result = None
for file_path in self._get_full_file_paths():
df = self.do_read(file_path)
if result is None:
result = df
else:
result = self.do_concat(result, df)
self._set_output_df(data, result)
has_input(self)
¶Returns True if the rule needs a dataframe input to operate on, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.
Source code in etlrules/backends/common/io/files.py
def has_input(self):
return False
BaseWriteFileRule (UnaryOpBaseRule)
¶
Source code in etlrules/backends/common/io/files.py
class BaseWriteFileRule(UnaryOpBaseRule):
EXCLUDE_FROM_SERIALIZE = ("named_output", )
def __init__(self, file_name, file_dir=".", named_input=None, name=None, description=None, strict=True):
super().__init__(named_input=named_input, named_output=None, name=name, description=description, strict=strict)
self.file_name = file_name
self.file_dir = file_dir
def has_output(self):
return False
def do_write(self, file_name: str, file_dir: str, df) -> None:
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
self.do_write(subst_string(self.file_name), subst_string(self.file_dir), df)
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/io/files.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
self.do_write(subst_string(self.file_name), subst_string(self.file_dir), df)
has_output(self)
¶Returns True if the rule produces a dataframe, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.
Source code in etlrules/backends/common/io/files.py
def has_output(self):
return False
ReadCSVFileRule (BaseReadFileRule)
¶
Reads one or multiple csv files from a directory and persists it as a dataframe for subsequent rules to operate on.
Basic usage::
1 2 3 4 5 6 7 8 9 10 11 12 13 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name |
str |
The name of the csv file to load. The format will be inferred from the extension of the file. A simple text csv file will be inferred from the .csv extension. The extensions like .zip, .gz, .bz2, .xz will extract a single compressed csv file from the given input compressed file. file_name can also be a regular expression (specify regex=True in that case). The reader will find all the files in the file_dir directory that match the regular expression and extract all those csv file and concatenate them into a single dataframe. For example, file_name=".*.csv", file_dir=".", regex=True will extract all the files with the .csv extension from the current directory. It can also be an URI (e.g. https://example.com/mycsv.csv) |
required |
file_dir |
Optional[str] |
The file directory where the file_name is located. When file_name is a regular expression and the regex parameter is True, file_dir is the directory that is inspected for any files that match the regular expression. Optional. For files it defaults to . (ie the current directory). Ignored for URIs. |
None |
regex |
bool |
When True, the file_name is interpreted as a regular expression. Defaults to False. |
False |
separator |
str |
The single character to be used as separator in the csv file. Defaults to , (comma). |
',' |
header |
bool |
When True, the first line is interpreted as the header and the column names are extracted from it. When False, the first line is part of the data and the columns will have names like 0, 1, 2, etc. Defaults to True. |
True |
skip_header_rows |
Optional[int] |
Optional number of rows to skip at the top of the file, before the header. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
IOError |
raised when the file is not found. |
Source code in etlrules/backends/common/io/files.py
class ReadCSVFileRule(BaseReadFileRule):
r""" Reads one or multiple csv files from a directory and persists it as a dataframe for subsequent rules to operate on.
Basic usage::
# reads a file data.csv and persists it as the main output of the rule
rule = ReadCSVFileRule("data.csv", "/home/myuser/")
rule.apply(data)
# reads a file test_data.csv and persists it as the input_data named output
# other rules can specify input_data as their named_input to operate on it
rule = ReadCSVFileRule("test_data.csv", "/home/myuser/", named_output="input_data")
rule.apply(data)
# extracts all files starting with data followed by 4 digits and concatenate them
# e.g. data1234.csv, data5678.csv, etc.
rule = ReadCSVFileRule("data[0-9]{4}.csv", "/home/myuser/", regex=True, named_output="input_data")
rule.apply(data)
Args:
file_name: The name of the csv file to load. The format will be inferred from the extension of the file.
A simple text csv file will be inferred from the .csv extension. The extensions like .zip, .gz, .bz2, .xz
will extract a single compressed csv file from the given input compressed file.
file_name can also be a regular expression (specify regex=True in that case).
The reader will find all the files in the file_dir directory that match the regular expression and extract
all those csv file and concatenate them into a single dataframe.
For example, file_name=".*\.csv", file_dir=".", regex=True will extract all the files with the .csv extension
from the current directory.
It can also be an URI (e.g. https://example.com/mycsv.csv)
file_dir: The file directory where the file_name is located. When file_name is a regular expression and
the regex parameter is True, file_dir is the directory that is inspected for any files that match the
regular expression. Optional.
For files it defaults to . (ie the current directory). Ignored for URIs.
regex: When True, the file_name is interpreted as a regular expression. Defaults to False.
separator: The single character to be used as separator in the csv file. Defaults to , (comma).
header: When True, the first line is interpreted as the header and the column names are extracted from it.
When False, the first line is part of the data and the columns will have names like 0, 1, 2, etc.
Defaults to True.
skip_header_rows: Optional number of rows to skip at the top of the file, before the header.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
IOError: raised when the file is not found.
"""
def __init__(self, file_name: str, file_dir: Optional[str]=None, regex: bool=False, separator: str=",",
header: bool=True, skip_header_rows: Optional[int]=None,
named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(file_name=file_name, file_dir=file_dir, regex=regex, named_output=named_output, name=name, description=description, strict=strict)
self.separator = separator
self.header = header
self.skip_header_rows = skip_header_rows
ReadParquetFileRule (BaseReadFileRule)
¶
Reads one or multiple parquet files from a directory and persists it as a dataframe for subsequent rules to operate on.
Basic usage::
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name |
str |
The name of the parquet file to load. The format will be inferred from the extension of the file. file_name can also be a regular expression (specify regex=True in that case). The reader will find all the files in the file_dir directory that match the regular expression and extract all those parquet file and concatenate them into a single dataframe. For example, file_name=".*.parquet", file_dir=".", regex=True will extract all the files with the .parquet extension from the current directory. |
required |
file_dir |
str |
The file directory where the file_name is located. When file_name is a regular expression and the regex parameter is True, file_dir is the directory that is inspected for any files that match the regular expression. Defaults to . (ie the current directory). |
'.' |
regex |
bool |
When True, the file_name is interpreted as a regular expression. Defaults to False. |
False |
columns |
Optional[Sequence[str]] |
A subset of the columns in the parquet file to load. |
None |
filters |
Union[List[Tuple], List[List[Tuple]]] |
A list of filters to apply to filter the rows returned. Rows which do not match the filter conditions will be removed from scanned data. When passed as a List[List[Tuple]], the conditions in the inner lists are AND-end together with the top level condition OR-ed together. Eg: ((cond1 AND cond2...) OR (cond3 AND cond4...)...) When passed as a List[Tuple], the conditions are AND-ed together. E.g.: cond1 AND cond2 AND cond3... Each condition is specified as a tuple of 3 elements: (column, operation, value). Column is the name of a column in the input dataframe. Operation is one of: "==", "=", ">", ">=", "<", "<=", "!=", "in", "not in". Value is a scalar value, int, float, string, etc. When the operation is in or not in, the value must be a list, tuple or set of values. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
IOError |
raised when the file is not found. |
ValueError |
raised if filters are specified but the format is incorrect. |
MissingColumnError |
raised if a column is specified in columns or filters but it doesn't exist in the input dataframe. |
Note
The parquet file can be compressed in which case the compression will be inferred from the file. The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".
Source code in etlrules/backends/common/io/files.py
class ReadParquetFileRule(BaseReadFileRule):
r""" Reads one or multiple parquet files from a directory and persists it as a dataframe for subsequent rules to operate on.
Basic usage::
# reads a file data.csv and persists it as the main output of the rule
rule = ReadParquetFileRule("data.parquet", "/home/myuser/")
rule.apply(data)
# reads a file test_data.parquet and persists it as the input_data named output
# other rules can specify input_data as their named_input to operate on it
rule = ReadParquetFileRule("test_data.parquet", "/home/myuser/", named_output="input_data")
rule.apply(data)
# reads all the files with the .parquet extension from the home dir of myuser and
# concatenates them into a single dataframe
rule = ReadParquetFileRule(".*\.parquet", "/home/myuser/", named_output="input_data")
rule.apply(data)
# reads only the A,B,C columns from the file data.csv file
rule = ReadParquetFileRule("data.parquet", "/home/myuser/", columns=["A", "B", "C"])
rule.apply(data)
# reads only those rows where column A is greater than 10 and column B is True
rule = ReadParquetFileRule("data.parquet", "/home/myuser/", filters=[["A", ">=", 10], ["B", "==", True]])
rule.apply(data)
Args:
file_name: The name of the parquet file to load. The format will be inferred from the extension of the file.
file_name can also be a regular expression (specify regex=True in that case).
The reader will find all the files in the file_dir directory that match the regular expression and extract
all those parquet file and concatenate them into a single dataframe.
For example, file_name=".*\.parquet", file_dir=".", regex=True will extract all the files with the .parquet extension
from the current directory.
file_dir: The file directory where the file_name is located. When file_name is a regular expression and
the regex parameter is True, file_dir is the directory that is inspected for any files that match the
regular expression.
Defaults to . (ie the current directory).
regex: When True, the file_name is interpreted as a regular expression. Defaults to False.
columns: A subset of the columns in the parquet file to load.
filters: A list of filters to apply to filter the rows returned. Rows which do not match the filter conditions
will be removed from scanned data.
When passed as a List[List[Tuple]], the conditions in the inner lists are AND-end together with the top level
condition OR-ed together. Eg: ((cond1 AND cond2...) OR (cond3 AND cond4...)...)
When passed as a List[Tuple], the conditions are AND-ed together. E.g.: cond1 AND cond2 AND cond3...
Each condition is specified as a tuple of 3 elements: (column, operation, value).
Column is the name of a column in the input dataframe.
Operation is one of: "==", "=", ">", ">=", "<", "<=", "!=", "in", "not in".
Value is a scalar value, int, float, string, etc. When the operation is in or not in, the value must be a list, tuple or set of values.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
IOError: raised when the file is not found.
ValueError: raised if filters are specified but the format is incorrect.
MissingColumnError: raised if a column is specified in columns or filters but it doesn't exist in the input dataframe.
Note:
The parquet file can be compressed in which case the compression will be inferred from the file.
The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".
"""
SUPPORTED_FILTERS_OPS = {"==", "=", ">", ">=", "<", "<=", "!=", "in", "not in"}
def __init__(self, file_name: str, file_dir: str=".", columns: Optional[Sequence[str]]=None, filters:Optional[Union[List[Tuple], List[List[Tuple]]]]=None, regex: bool=False, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(
file_name=file_name, file_dir=file_dir, regex=regex, named_output=named_output,
name=name, description=description, strict=strict)
self.columns = columns
self.filters = self._get_filters(filters) if filters is not None else None
def _raise_filters_invalid(self, error: str) -> NoReturn:
raise ValueError(f"Invalid filters. It must be a List[Tuple] or List[List[Tuple]] with each Tuple being (column, op, value): {error}")
def _validate_tuple(self, tpl):
if len(tpl) != 3 or not isinstance(tpl[0], str) or not isinstance(tpl[1], str):
self._raise_filters_invalid(f"Third level expected a list/tuple (cond, op, value), got: {tpl}.")
op = tpl[1]
if op not in self.SUPPORTED_FILTERS_OPS:
self._raise_filters_invalid(f"Invalid operator {op} in {tpl}. Must be one of: {self.SUPPORTED_FILTERS_OPS}.")
value = tpl[2]
if op in ("in", "not in"):
if not isinstance(value, (list, tuple, set)):
self._raise_filters_invalid(f"Invalid value type for {value} for {op} operand in {tpl}. Must be list/tuple/set.")
else:
value = list(value)
return (tpl[0], op, value)
def _get_filters(self, filters):
lst = []
if isinstance(filters, (list, tuple)):
if not filters:
return None
for filter2 in filters:
if isinstance(filter2, (list, tuple)) and filter2:
if len(filter2) == 3 and isinstance(filter2[0], str):
# List[Tuple] form
tpl = self._validate_tuple(filter2)
lst.append(tpl)
else:
lst2 = []
for filter3 in filter2:
if isinstance(filter3, (list, tuple)) and filter3:
tpl = self._validate_tuple(filter3)
lst2.append(tpl)
else:
self._raise_filters_invalid(f"Third level expected a list/tuple, got: {filter3}.")
lst.append(lst2)
else:
self._raise_filters_invalid(f"Second level expected a list/tuple, got: {filter2}.")
else:
self._raise_filters_invalid(f"Top level expected a list/tuple, got: {filters}")
return lst
WriteCSVFileRule (BaseWriteFileRule)
¶
Writes an existing dataframe to a csv file (optionally compressed) on disk.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name |
str |
The name of the csv file to write to disk. It will be written in the directory specified by the file_dir parameter. |
required |
file_dir |
str |
The file directory where the file_name should be written. Defaults to . (ie the current directory). |
'.' |
separator |
str |
The single character to separate values in the csv file. Defaults to , (comma). |
',' |
header |
bool |
When True, the first line will contain the columns separated by the separator. When False, the columns will not be written and the first line contains data. Defaults to True. |
True |
compression |
Optional[str] |
Compress the csv file using a supported compression algorithms. Optional. When the compression is specified, the file_name must end with the extension associate with that compression format. The following options are supported: zip - file_name must end with .zip (e.g. output.csv.zip), will produced a zipped csv file gzip - file_name must end with .gz (e.g. output.csv.gz), will produced a gzipped csv file bz2 - file_name must end with .bz2 (e.g. output.csv.bz2), will produced a bzipped csv xz - file_name must end with .xz (e.g. output.csv.xz), will produced a xz-compressed csv file |
None |
named_input |
Optional[str] |
Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True. |
True |
Source code in etlrules/backends/common/io/files.py
class WriteCSVFileRule(BaseWriteFileRule):
""" Writes an existing dataframe to a csv file (optionally compressed) on disk.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.
Basic usage::
# writes a file data.csv and persists the main output of the previous rule to it
rule = WriteCSVFileRule("data.csv", "/home/myuser/")
rule.apply(data)
# writes a file test_data.csv and persists the dataframe named input_data into it
rule = WriteCSVFileRule("test_data.csv", "/home/myuser/", named_input="input_data")
rule.apply(data)
Args:
file_name: The name of the csv file to write to disk. It will be written in the directory
specified by the file_dir parameter.
file_dir: The file directory where the file_name should be written.
Defaults to . (ie the current directory).
separator: The single character to separate values in the csv file. Defaults to , (comma).
header: When True, the first line will contain the columns separated by the separator.
When False, the columns will not be written and the first line contains data.
Defaults to True.
compression: Compress the csv file using a supported compression algorithms. Optional.
When the compression is specified, the file_name must end with the extension associate with that
compression format. The following options are supported:
zip - file_name must end with .zip (e.g. output.csv.zip), will produced a zipped csv file
gzip - file_name must end with .gz (e.g. output.csv.gz), will produced a gzipped csv file
bz2 - file_name must end with .bz2 (e.g. output.csv.bz2), will produced a bzipped csv
xz - file_name must end with .xz (e.g. output.csv.xz), will produced a xz-compressed csv file
named_input (Optional[str]): Select by name the dataframe to write from the input data.
Optional. When not specified, the main output of the previous rule will be written.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True.
"""
COMPRESSIONS = {
'zip': '.zip',
'gzip': '.gz',
'bz2': '.bz2',
'xz': '.xz',
}
def __init__(self, file_name: str, file_dir: str=".", separator: str=",", header: bool=True, compression: Optional[str]=None, named_input: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(
file_name=file_name, file_dir=file_dir, named_input=named_input,
name=name, description=description, strict=strict)
self.separator = separator
self.header = header
assert compression is None or compression in self.COMPRESSIONS.keys(), f"Unsupported compression '{compression}'. It must be one of: {self.COMPRESSIONS.keys()}."
if compression:
assert file_name.endswith(self.COMPRESSIONS[compression]), f"The file name {file_name} must have the extension {self.COMPRESSIONS[compression]} when the compression is set to {compression}."
self.compression = compression
WriteParquetFileRule (BaseWriteFileRule)
¶
Writes an existing dataframe to a parquet file on disk.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
file_name |
str |
The name of the parquet file to write to disk. It will be written in the directory specified by the file_dir parameter. |
required |
file_dir |
str |
The file directory where the file_name should be written. Defaults to . (ie the current directory). |
'.' |
compression |
Optional[str] |
Compress the parquet file using a supported compression algorithms. Optional. The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd". |
None |
named_input |
Optional[str] |
Select by name the dataframe to write from the input data. Optional. When not specified, the main output of the previous rule will be written. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True. |
True |
Source code in etlrules/backends/common/io/files.py
class WriteParquetFileRule(BaseWriteFileRule):
""" Writes an existing dataframe to a parquet file on disk.
The rule is a final rule, which means it produces no additional outputs, it takes any of the existing outputs and writes it to disk.
Basic usage::
# writes a file data.parquet and persists the main output of the previous rule to it
rule = WriteParquetFileRule("data.parquet", "/home/myuser/")
rule.apply(data)
# writes a file test_data.parquet and persists the dataframe named input_data into it
rule = WriteParquetFileRule("test_data.parquet", "/home/myuser/", named_input="input_data")
rule.apply(data)
Args:
file_name: The name of the parquet file to write to disk. It will be written in the directory
specified by the file_dir parameter.
file_dir: The file directory where the file_name should be written.
Defaults to . (ie the current directory).
compression: Compress the parquet file using a supported compression algorithms. Optional.
The following compression algorithms are supported: "snappy", "gzip", "brotli", "lz4", "zstd".
named_input (Optional[str]): Select by name the dataframe to write from the input data.
Optional. When not specified, the main output of the previous rule will be written.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True.
"""
COMPRESSIONS = ("snappy", "gzip", "brotli", "lz4", "zstd")
def __init__(self, file_name: str, file_dir: str=".", compression: Optional[str]=None, named_input: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(
file_name=file_name, file_dir=file_dir, named_input=named_input,
name=name, description=description, strict=strict)
assert compression is None or compression in self.COMPRESSIONS, f"Unsupported compression '{compression}'. It must be one of: {self.COMPRESSIONS}."
self.compression = compression
joins
¶
BaseJoinRule (BinaryOpBaseRule)
¶
Source code in etlrules/backends/common/joins.py
class BaseJoinRule(BinaryOpBaseRule):
JOIN_TYPE = None
def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], key_columns_left: Iterable[str], key_columns_right: Optional[Iterable[str]]=None, suffixes: Iterable[Optional[str]]=(None, "_r"), named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input_left=named_input_left, named_input_right=named_input_right, named_output=named_output, name=name, description=description, strict=strict)
assert isinstance(key_columns_left, (list, tuple)) and key_columns_left and all(isinstance(col, str) for col in key_columns_left), "JoinRule: key_columns_left must a non-empty list of tuple with str column names"
self.key_columns_left = [col for col in key_columns_left]
self.key_columns_right = [col for col in key_columns_right] if key_columns_right is not None else None
assert isinstance(suffixes, (list, tuple)) and len(suffixes) == 2 and all(s is None or isinstance(s, str) for s in suffixes), "The suffixes must be a list or tuple of 2 elements"
self.suffixes = suffixes
def _get_key_columns(self):
return self.key_columns_left, self.key_columns_right or self.key_columns_left
def do_apply(self, left_df, right_df):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
assert self.JOIN_TYPE in {"left", "right", "outer", "inner"}
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
left_on, right_on = self._get_key_columns()
if not set(left_on) <= set(left_df.columns):
raise MissingColumnError(f"Missing columns in join in the left dataframe: {set(left_on) - set(left_df.columns)}")
if not set(right_on) <= set(right_df.columns):
raise MissingColumnError(f"Missing columns in join in the right dataframe: {set(right_on) - set(right_df.columns)}")
df = self.do_apply(left_df, right_df)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/joins.py
def apply(self, data):
assert self.JOIN_TYPE in {"left", "right", "outer", "inner"}
super().apply(data)
left_df = self._get_input_df_left(data)
right_df = self._get_input_df_right(data)
left_on, right_on = self._get_key_columns()
if not set(left_on) <= set(left_df.columns):
raise MissingColumnError(f"Missing columns in join in the left dataframe: {set(left_on) - set(left_df.columns)}")
if not set(right_on) <= set(right_df.columns):
raise MissingColumnError(f"Missing columns in join in the right dataframe: {set(right_on) - set(right_df.columns)}")
df = self.do_apply(left_df, right_df)
self._set_output_df(data, df)
InnerJoinRule (BaseJoinRule)
¶
Performs a database-style inner join operation on two data frames.
A join involves two data frames left_df
An inner join specifies that only those rows that have key values in both left and right will be copied over and merged into the result data frame. Any rows without corresponding values on the other side (be it left or right) will be dropped from the result.
Examples:
left dataframe::
1 2 3 |
|
right dataframe::
1 2 3 |
|
result (key columns=["A"])::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
key_columns_left |
Iterable[str] |
A list or tuple of column names to join on (columns in the left data frame) |
required |
key_columns_right |
Optional[Iterable[str]] |
A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too. |
required |
suffixes |
Iterable[Optional[str]] |
A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns). |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns (keys) are missing from any of the two input data frames. |
Source code in etlrules/backends/common/joins.py
class InnerJoinRule(BaseJoinRule):
""" Performs a database-style inner join operation on two data frames.
A join involves two data frames left_df <join> right_df with the result performing a
database style join or a merge of the two, with the resulting columns coming from both
dataframes.
For example, if the left dataframe has two columns A, B and the right dataframe has two
column A, C, and assuming A is the key column the result will have three columns A, B, C.
The rows that have the same value in the key column A will be merged on the same row in the
result dataframe.
An inner join specifies that only those rows that have key values in both left and right
will be copied over and merged into the result data frame. Any rows without corresponding
values on the other side (be it left or right) will be dropped from the result.
Example:
left dataframe::
| A | B |
| 1 | a |
| 2 | b |
right dataframe::
| A | C |
| 1 | c |
| 3 | d |
result (key columns=["A"])::
| A | B | C |
| 1 | a | c |
Args:
named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
If not set or set to None, the key_columns_left is used on the right dataframe too.
suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
result data frame for those columns that have the same name (and are not key columns).
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
"""
JOIN_TYPE = "inner"
LeftJoinRule (BaseJoinRule)
¶
Performs a database-style left join operation on two data frames.
A join involves two data frames left_df
A left join specifies that all the rows in the left dataframe will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the right dataframe. The right columns will be populated with NaNs/None when there is no corresponding row on the right.
Examples:
left dataframe::
1 2 3 |
|
right dataframe::
1 2 3 |
|
result (key columns=["A"])::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
key_columns_left |
Iterable[str] |
A list or tuple of column names to join on (columns in the left data frame) |
required |
key_columns_right |
Optional[Iterable[str]] |
A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too. |
required |
suffixes |
Iterable[Optional[str]] |
A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns). |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns (keys) are missing from any of the two input data frames. |
Source code in etlrules/backends/common/joins.py
class LeftJoinRule(BaseJoinRule):
""" Performs a database-style left join operation on two data frames.
A join involves two data frames left_df <join> right_df with the result performing a
database style join or a merge of the two, with the resulting columns coming from both
dataframes.
For example, if the left dataframe has two columns A, B and the right dataframe has two
column A, C, and assuming A is the key column the result will have three columns A, B, C.
The rows that have the same value in the key column A will be merged on the same row in the
result dataframe.
A left join specifies that all the rows in the left dataframe will be present in the result,
irrespective of whether there's a corresponding row with the same values in the key columns in
the right dataframe. The right columns will be populated with NaNs/None when there is no
corresponding row on the right.
Example:
left dataframe::
| A | B |
| 1 | a |
| 2 | b |
right dataframe::
| A | C |
| 1 | c |
| 3 | d |
result (key columns=["A"])::
| A | B | C |
| 1 | a | c |
| 2 | b | NA |
Args:
named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
If not set or set to None, the key_columns_left is used on the right dataframe too.
suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
result data frame for those columns that have the same name (and are not key columns).
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
"""
JOIN_TYPE = "left"
OuterJoinRule (BaseJoinRule)
¶
Performs a database-style left join operation on two data frames.
A join involves two data frames left_df
An outer join specifies that all the rows in the both left and right dataframes will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the other dataframe. The missing side will have its columns populated with NA when the rows are missing.
Examples:
left dataframe::
1 2 3 |
|
right dataframe::
1 2 3 |
|
result (key columns=["A"])::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
key_columns_left |
Iterable[str] |
A list or tuple of column names to join on (columns in the left data frame) |
required |
key_columns_right |
Optional[Iterable[str]] |
A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too. |
required |
suffixes |
Iterable[Optional[str]] |
A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns). |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns (keys) are missing from any of the two input data frames. |
Source code in etlrules/backends/common/joins.py
class OuterJoinRule(BaseJoinRule):
""" Performs a database-style left join operation on two data frames.
A join involves two data frames left_df <join> right_df with the result performing a
database style join or a merge of the two, with the resulting columns coming from both
dataframes.
For example, if the left dataframe has two columns A, B and the right dataframe has two
column A, C, and assuming A is the key column the result will have three columns A, B, C.
The rows that have the same value in the key column A will be merged on the same row in the
result dataframe.
An outer join specifies that all the rows in the both left and right dataframes will be present
in the result, irrespective of whether there's a corresponding row with the same values in the
key columns in the other dataframe. The missing side will have its columns populated with NA
when the rows are missing.
Example:
left dataframe::
| A | B |
| 1 | a |
| 2 | b |
right dataframe::
| A | C |
| 1 | c |
| 3 | d |
result (key columns=["A"])::
| A | B | C |
| 1 | a | c |
| 2 | b | NA |
| 3 | NA | d |
Args:
named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
If not set or set to None, the key_columns_left is used on the right dataframe too.
suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
result data frame for those columns that have the same name (and are not key columns).
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
"""
JOIN_TYPE = "outer"
RightJoinRule (BaseJoinRule)
¶
Performs a database-style left join operation on two data frames.
A join involves two data frames left_df
A right join specifies that all the rows in the right dataframe will be present in the result, irrespective of whether there's a corresponding row with the same values in the key columns in the left dataframe. The left columns will be populated with NA when there is no corresponding row on the left.
Examples:
left dataframe::
1 2 3 |
|
right dataframe::
1 2 3 |
|
result (key columns=["A"])::
1 2 3 |
|
Note
A right join is equivalent to a left join with the dataframes inverted, ie:
left_df
Parameters:
Name | Type | Description | Default |
---|---|---|---|
named_input_left |
Optional[str] |
Which dataframe to use as the input on the left side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_input_right |
Optional[str] |
Which dataframe to use as the input on the right side of the join. When set to None, the input is taken from the main output of the previous rule. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
key_columns_left |
Iterable[str] |
A list or tuple of column names to join on (columns in the left data frame) |
required |
key_columns_right |
Optional[Iterable[str]] |
A list or tuple of column names to join on (columns in the right data frame). If not set or set to None, the key_columns_left is used on the right dataframe too. |
required |
suffixes |
Iterable[Optional[str]] |
A list or tuple of two values which will be set as suffixes for the columns in the result data frame for those columns that have the same name (and are not key columns). |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if any columns (keys) are missing from any of the two input data frames. |
Source code in etlrules/backends/common/joins.py
class RightJoinRule(BaseJoinRule):
""" Performs a database-style left join operation on two data frames.
A join involves two data frames left_df <join> right_df with the result performing a
database style join or a merge of the two, with the resulting columns coming from both
dataframes.
For example, if the left dataframe has two columns A, B and the right dataframe has two
column A, C, and assuming A is the key column the result will have three columns A, B, C.
The rows that have the same value in the key column A will be merged on the same row in the
result dataframe.
A right join specifies that all the rows in the right dataframe will be present in the result,
irrespective of whether there's a corresponding row with the same values in the key columns in
the left dataframe. The left columns will be populated with NA when there is no
corresponding row on the left.
Example:
left dataframe::
| A | B |
| 1 | a |
| 2 | b |
right dataframe::
| A | C |
| 1 | c |
| 3 | d |
result (key columns=["A"])::
| A | B | C |
| 1 | a | c |
| 3 | NA | d |
Note:
A right join is equivalent to a left join with the dataframes inverted, ie:
left_df <left_join> right_df
is equivalent to
right_df <right_join> left_df
although the order of the rows will be different.
Args:
named_input_left (Optional[str]): Which dataframe to use as the input on the left side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
named_input_right (Optional[str]): Which dataframe to use as the input on the right side of the join.
When set to None, the input is taken from the main output of the previous rule.
Set it to a string value, the name of an output dataframe of a previous rule.
key_columns_left (Iterable[str]): A list or tuple of column names to join on (columns in the left data frame)
key_columns_right (Optional[Iterable[str]]): A list or tuple of column names to join on (columns in the right data frame).
If not set or set to None, the key_columns_left is used on the right dataframe too.
suffixes (Iterable[Optional[str]]): A list or tuple of two values which will be set as suffixes for the columns in the
result data frame for those columns that have the same name (and are not key columns).
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if any columns (keys) are missing from any of the two input data frames.
"""
JOIN_TYPE = "right"
newcolumns
¶
AddNewColumnRule (UnaryOpBaseRule)
¶
Adds a new column and sets it to the value of an evaluated expression.
Example::
1 2 3 4 5 |
|
AddNewColumnRule("Sum", "df['A'] + df['B']").apply(df)
Result::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_column |
str |
The name of the new column to be added. |
required |
column_expression |
str |
An expression that gets evaluated and produces the value for the new column. The syntax: df["EXISTING_COL"] can be used in the expression to refer to other columns in the dataframe. |
required |
column_type |
Optional[str] |
An optional type to convert the result to. If not specified, the type is determined from the output of the expression, which can sometimes differ based on the backend. If the input dataframe is empty, this type ensures the column will be of the specified type, rather than default to string type. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised in strict mode only if a column with the same name already exists in the dataframe. |
ExpressionSyntaxError |
raised if the column expression has a Python syntax error. |
UnsupportedTypeError |
raised if the column_type parameter is specified and not supported. |
TypeError |
raised if an operation is not supported between the types involved. raised when the column type is specified but the conversion to that type fails. |
NameError |
raised if an unknown variable is used |
KeyError |
raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN']) |
Note
The implementation will try to use dataframe operations for performance, but when those are not supported it will fallback to row level operations.
Note
NA are treated slightly differently between dataframe level operations and row level. At dataframe level operations, NAs in operations will make the result be NA. In row level operations, NAs will generally raise a TypeError. To avoid such behavior, fill the NAs before performing operations.
Source code in etlrules/backends/common/newcolumns.py
class AddNewColumnRule(UnaryOpBaseRule):
""" Adds a new column and sets it to the value of an evaluated expression.
Example::
Given df:
| A | B |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
> AddNewColumnRule("Sum", "df['A'] + df['B']").apply(df)
Result::
| A | B | Sum |
| 1 | 2 | 3 |
| 2 | 3 | 5 |
| 3 | 4 | 7 |
Args:
output_column: The name of the new column to be added.
column_expression: An expression that gets evaluated and produces the value for the new column.
The syntax: df["EXISTING_COL"] can be used in the expression to refer to other columns in the dataframe.
column_type: An optional type to convert the result to. If not specified, the type is determined from the
output of the expression, which can sometimes differ based on the backend.
If the input dataframe is empty, this type ensures the column will be of the specified type, rather than
default to string type.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
ExpressionSyntaxError: raised if the column expression has a Python syntax error.
UnsupportedTypeError: raised if the column_type parameter is specified and not supported.
TypeError: raised if an operation is not supported between the types involved. raised when the column type is specified
but the conversion to that type fails.
NameError: raised if an unknown variable is used
KeyError: raised if you try to use an unknown column (i.e. df['ANY_UNKNOWN_COLUMN'])
Note:
The implementation will try to use dataframe operations for performance, but when those are not supported it
will fallback to row level operations.
Note:
NA are treated slightly differently between dataframe level operations and row level.
At dataframe level operations, NAs in operations will make the result be NA.
In row level operations, NAs will generally raise a TypeError.
To avoid such behavior, fill the NAs before performing operations.
"""
EXCLUDE_FROM_COMPARE = ('_column_expression', )
def __init__(self, output_column: str, column_expression: str, column_type: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.output_column = output_column
self.column_expression = column_expression
if column_type is not None and column_type not in SUPPORTED_TYPES:
raise UnsupportedTypeError(f"Unsupported column type: '{column_type}'. It must be one of: {SUPPORTED_TYPES}")
self.column_type = column_type
self._column_expression = self.get_column_expression()
def _validate_columns(self, df_columns):
if self.strict and self.output_column in df_columns:
raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")
AddRowNumbersRule (UnaryOpBaseRule)
¶
Adds a new column with row numbers.
Example::
1 2 3 4 5 |
|
AddRowNumbersRule("Row_Number").apply(df)
Result::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
output_column |
str |
The name of the new column to be added. |
required |
start |
int |
The value to start the numbers from. Defaults to 0. |
0 |
step |
int |
The increment to be used between row numbers. Defaults to 1. |
1 |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
ColumnAlreadyExistsError |
raised in strict mode only if a column with the same name already exists in the dataframe. |
Source code in etlrules/backends/common/newcolumns.py
class AddRowNumbersRule(UnaryOpBaseRule):
""" Adds a new column with row numbers.
Example::
Given df:
| A | B |
| 1 | 2 |
| 2 | 3 |
| 3 | 4 |
> AddRowNumbersRule("Row_Number").apply(df)
Result::
| A | B | Row_Number |
| 1 | 2 | 0 |
| 2 | 3 | 1 |
| 3 | 4 | 2 |
Args:
output_column: The name of the new column to be added.
start: The value to start the numbers from. Defaults to 0.
step: The increment to be used between row numbers. Defaults to 1.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
ColumnAlreadyExistsError: raised in strict mode only if a column with the same name already exists in the dataframe.
"""
def __init__(self, output_column: str, start: int=0, step: int=1, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.output_column = output_column
self.start = start
self.step = step
def _validate_columns(self, df_columns):
if self.strict and self.output_column in df_columns:
raise ColumnAlreadyExistsError(f"Column {self.output_column} already exists in the input dataframe.")
numeric
¶
AbsRule (BaseAssignColumnRule)
¶
Converts numbers to absolute values.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The name of the column to convert to absolute values. |
required |
output_column |
Optional[str] |
An optional new column with the absolute values. If provided the existing column is unchanged and a new column is created with the absolute values. If not provided, the result is updated in place. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/numeric.py
class AbsRule(BaseAssignColumnRule):
""" Converts numbers to absolute values.
Basic usage::
rule = AbsRule("col_A")
rule.apply(data)
Args:
input_column (str): The name of the column to convert to absolute values.
output_column (Optional[str]): An optional new column with the absolute values.
If provided the existing column is unchanged and a new column is created with the absolute values.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
RoundRule (BaseAssignColumnRule)
¶
Rounds a set of columns to specified decimal places.
Basic usage::
1 2 3 4 5 6 7 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A column with values to round as per the specified scale. |
required |
scale |
Union[int, Sequence[int]] |
An integer specifying the number of decimal places to round to. |
required |
output_column |
Optional[str] |
An optional name for a new column with the rounded values. If provided, the existing column is unchanged and the new column is created with the results. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/numeric.py
class RoundRule(BaseAssignColumnRule):
""" Rounds a set of columns to specified decimal places.
Basic usage::
# rounds Col_A to 2dps
rule = RoundRule("Col_A", 2)
rule.apply(data)
# rounds Col_B to 0dps and output the results into Col_C, Col_B remains unchanged
rule = RoundRule("Col_B", 0, output_column="Col_C")
rule.apply(data)
Args:
input_column: A column with values to round as per the specified scale.
scale: An integer specifying the number of decimal places to round to.
output_column (Optional[str]): An optional name for a new column with the rounded values.
If provided, the existing column is unchanged and the new column is created with the results.
If not provided, the result is updated in place.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
def __init__(self, input_column: str, scale: Union[int, Sequence[int]], output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
assert isinstance(scale, int), "scale must be an integer value"
self.scale = scale
strings
¶
StrCapitalizeRule (BaseAssignColumnRule)
¶
Converts the values in a string column to capitalized values.
Capitalization will convert the first letter in the string to upper case and the rest of the letters to lower case.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A string column with the values to capitalize. |
required |
output_column |
Optional[str] |
An optional new names for the column with the capitalized values. If provided, the existing column is unchanged, and a new column is created with the capitalized values. If not provided, the result is updated in place. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrCapitalizeRule(BaseAssignColumnRule):
""" Converts the values in a string column to capitalized values.
Capitalization will convert the first letter in the string to upper case and the rest of the letters
to lower case.
Basic usage::
rule = StrCapitalizeRule("col_A")
rule.apply(data)
Args:
input_column (str): A string column with the values to capitalize.
output_column (Optional[str]): An optional new names for the column with the capitalized values.
If provided, the existing column is unchanged, and a new column is created with the capitalized values.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
StrExtractRule (UnaryOpBaseRule, ColumnsInOutMixin)
¶
Extract substrings from strings columns using regular expressions.
Basic usage::
1 2 3 4 5 6 7 8 9 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A column to extract data from. |
required |
regular_expression |
str |
The regular expression used to extract data. The regular expression must have 1 or more groups - ie sections between brackets. The groups do the actual extraction of data. If there is a single group, then the column can be modified in place (ie no output_columns are needed) but if there are multiple groups, then output_columns must be specified as each group will be extracted in a new output column. |
required |
keep_original_value |
bool |
Only used in case there isn't a match and it specifies if NA should be used in the output or the original value. Defaults: True. If the regular expression has multiple groups and therefore multiple output_columns, only the first output column will keep the original value, the rest will be populated with NA. |
False |
output_columns |
Optional[Iterable[str]] |
A list of new names for the result columns. Optional. If provided, it must have one output_column per regular expression group. For example, given the regular expression "a_([\d])_([\d])" with 2 groups, then the output columns must have 2 columns (one per group) - for example ["out_1", "out_2"]. The existing columns are unchanged, and new columns are created with extracted values. If not provided, the result is updated in place (only possible if the regular expression has a single group). |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if an output_column already exists in the dataframe. |
ValueError |
raised if output_columns is provided and not the same length as the number of groups in the regular expression. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrExtractRule(UnaryOpBaseRule, ColumnsInOutMixin):
r""" Extract substrings from strings columns using regular expressions.
Basic usage::
# extracts the number between start_ and _end
# ie: for an input value of start_1234_end - will extract 1234 in col_A
rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end")
rule.apply(data)
# extracts with multiple groups, extracting the single digit at the end as well
# for an input value of start_1234_end_9, col_1 will extract 1234, col_2 will extract 9
rule = StrExtractRule("col_A", regular_expression=r"start_([\d]*)_end_([\d])", output_columns=["col_1", "col_2"])
rule.apply(data)
Args:
input_column (str): A column to extract data from.
regular_expression: The regular expression used to extract data.
The regular expression must have 1 or more groups - ie sections between brackets.
The groups do the actual extraction of data.
If there is a single group, then the column can be modified in place (ie no output_columns are needed) but
if there are multiple groups, then output_columns must be specified as each group will be extracted in a new
output column.
keep_original_value: Only used in case there isn't a match and it specifies if NA should be used in the output or the original value.
Defaults: True.
If the regular expression has multiple groups and therefore multiple output_columns, only the first output column
will keep the original value, the rest will be populated with NA.
output_columns (Optional[Iterable[str]]): A list of new names for the result columns.
Optional. If provided, it must have one output_column per regular expression group.
For example, given the regular expression "a_([\d])_([\d])" with 2 groups, then
the output columns must have 2 columns (one per group) - for example ["out_1", "out_2"].
The existing columns are unchanged, and new columns are created with extracted values.
If not provided, the result is updated in place (only possible if the regular expression has a single group).
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if an output_column already exists in the dataframe.
ValueError: raised if output_columns is provided and not the same length as the number of groups in the regular expression.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
def __init__(self, input_column: str, regular_expression: str, keep_original_value: bool=False, output_columns:Optional[Iterable[str]]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
self.input_column = input_column
self.output_columns = [out_col for out_col in output_columns] if output_columns else None
self.regular_expression = regular_expression
self._compiled_expr = re.compile(regular_expression)
groups = self._compiled_expr.groups
assert groups > 0, "The regular expression must have at least 1 group - ie a section in () - which gets extracted."
if self.output_columns is not None:
if len(self.output_columns) != groups:
raise ValueError(f"The regular expression has {groups} group(s), the output_columns must have {groups} column(s).")
if groups > 1 and self.output_columns is None:
raise ValueError(f"The regular expression has more than 1 groups in which case output_columns must be specified (one per group).")
self.keep_original_value = keep_original_value
StrLowerRule (BaseAssignColumnRule)
¶
Converts the values in a string column to lower case.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A string column to convert to lower case. |
required |
output_column |
Optional[str] |
An optional new names for the column with the lower case values. If provided, the existing column is unchanged, and a new column is created with the lower case values. If not provided, the result is updated in place. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrLowerRule(BaseAssignColumnRule):
""" Converts the values in a string column to lower case.
Basic usage::
rule = StrLowerRule("col_A")
rule.apply(data)
Args:
input_column (str): A string column to convert to lower case.
output_column (Optional[str]): An optional new names for the column with the lower case values.
If provided, the existing column is unchanged, and a new column is created with the lower case values.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
StrPadRule (BaseAssignColumnRule)
¶
Makes strings of a given width (justifies) by padding left or right with a fill character.
Basic usage::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A string column to be padded. |
required |
width |
int |
Pad with the fill_character to this width. |
required |
fill_character |
str |
Character to fill with. Defaults to whitespace. |
required |
how |
Literal['left', 'right'] |
How should the stripping be done. One of left or right. Left pads at the beggining of the string, right pads at the end. Default: left. |
'left' |
output_column |
Optional[str] |
An optional new column with the padded results. If provided, the existing column is unchanged and a new column is created with the results. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrPadRule(BaseAssignColumnRule):
""" Makes strings of a given width (justifies) by padding left or right with a fill character.
Basic usage::
# a value of ABCD will ABCD....
rule = StrPadRule("col_A", width=8, fill_character=".", how="right")
rule.apply(data)
Args:
input_column (str): A string column to be padded.
width: Pad with the fill_character to this width.
fill_character: Character to fill with. Defaults to whitespace.
how: How should the stripping be done. One of left or right.
Left pads at the beggining of the string, right pads at the end. Default: left.
output_column (Optional[str]): An optional new column with the padded results.
If provided, the existing column is unchanged and a new column is created with the results.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
PAD_LEFT = 'left'
PAD_RIGHT = 'right'
def __init__(self, input_column: str, width: int, fill_character: str, how: Literal[PAD_LEFT, PAD_RIGHT]=PAD_LEFT, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
assert how in (self.PAD_LEFT, self.PAD_RIGHT), f"Unknown how parameter {how}. It must be one of: {(self.PAD_LEFT, self.PAD_RIGHT)}"
self.how = how
self.width = width
self.fill_character = fill_character
StrSplitRejoinRule (BaseAssignColumnRule)
¶
Splits the values in a string column into an array of substrings based on a string separator then rejoin with a new separator, optionally sorting the substrings.
Note
The output is an array of substrings which can optionally be limited via the limit parameter to only
include the first
Basic usage::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
The column to split and rejoin. |
required |
separator |
str |
A literal value to split the string by. |
required |
limit |
Optional[int] |
A limit to the number of substrings. If specified, only the first |
None |
new_separator |
str |
A new separator used to rejoin the substrings. |
',' |
sort |
Optional[Literal['ascending', 'descending']] |
Optionally sorts the substrings before rejoining using the new_separator. It can be set to either ascending or descending, sorting the substrings accordingly. When the value is set to None, there is no sorting. |
None |
output_column |
Optional[str] |
An optional new column to hold the result. If provided, the existing column is unchanged and a new column is created with the result. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrSplitRejoinRule(BaseAssignColumnRule):
""" Splits the values in a string column into an array of substrings based on a string separator then rejoin with a new separator, optionally sorting the substrings.
Note:
The output is an array of substrings which can optionally be limited via the limit parameter to only
include the first <limit> number of substrings.
Basic usage::
# splits the col_A column on ,
# "b,d;a,c" will be split and rejoined as "b|c|d;a"
rule = StrSplitRejoinRule("col_A", separator=",", new_separator="|", sort="ascending")
rule.apply(data)
Args:
input_column (str): The column to split and rejoin.
separator: A literal value to split the string by.
limit: A limit to the number of substrings. If specified, only the first <limit> substrings are returned
plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.
new_separator: A new separator used to rejoin the substrings.
sort: Optionally sorts the substrings before rejoining using the new_separator.
It can be set to either ascending or descending, sorting the substrings accordingly.
When the value is set to None, there is no sorting.
output_column (Optional[str]): An optional new column to hold the result.
If provided, the existing column is unchanged and a new column is created with the result.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
SORT_ASCENDING = "ascending"
SORT_DESCENDING = "descending"
def __init__(self, input_column: str, separator: str, limit:Optional[int]=None, new_separator:str=",", sort:Optional[Literal[SORT_ASCENDING, SORT_DESCENDING]]=None, output_column:Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
assert separator and isinstance(separator, str)
self.separator = separator
self.limit = limit
assert isinstance(new_separator, str) and new_separator
self.new_separator = new_separator
assert sort in (None, self.SORT_ASCENDING, self.SORT_DESCENDING)
self.sort = sort
StrSplitRule (BaseAssignColumnRule)
¶
Splits a string into an array of substrings based on a string separator.
Note
The output is an array of substrings which can optionally be limited via the limit parameter to only
include the first
Basic usage::
1 2 3 4 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A string column to split. |
required |
separator |
str |
A literal value to split the string by. |
required |
limit |
Optional[int] |
A limit to the number of substrings. If specified, only the first |
None |
output_column |
Optional[str] |
An optional column to hold the result of the split. If provided, the existing column is unchanged and a new column is created with the result. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if the input_column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrSplitRule(BaseAssignColumnRule):
""" Splits a string into an array of substrings based on a string separator.
Note:
The output is an array of substrings which can optionally be limited via the limit parameter to only
include the first <limit> number of substrings.
If you need the output to be a string, perhaps joined on a different separator and optionally sorted
then use the StrSplitRejoinRule rule.
Basic usage::
# splits the col_A column on ,
# "a,b;c,d" will be split as ["a", "b;c", "d"]
rule = StrSplitRule("col_A", separator=",")
rule.apply(data)
Args:
input_column (str): A string column to split.
separator: A literal value to split the string by.
limit: A limit to the number of substrings. If specified, only the first <limit> substrings are returned
plus an additional remainder. At most, limit + 1 substrings are returned with the last beind the remainder.
output_column (Optional[str]): An optional column to hold the result of the split.
If provided, the existing column is unchanged and a new column is created with the result.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if the input_column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
def __init__(self, input_column: str, separator: str, limit: Optional[int]=None, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
assert separator and isinstance(separator, str)
self.separator = separator
self.limit = limit
StrStripRule (BaseAssignColumnRule)
¶
Strips leading, trailing or both whitespaces or other characters from the values in the input column.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A input column to strip characters from its values. |
required |
how |
Literal['left', 'right', 'both'] |
How should the stripping be done. One of left, right, both. Left strips leading characters, right trailing characters and both at both ends. |
'both' |
characters |
Optional[str] |
If set, it contains a list of characters to be stripped. When not specified or when set to None, whitespace is removed. |
None |
output_column |
Optional[str] |
An optional new column to hold the results. If provided, the existing column is unchanged and a new column is created with the results. If not provided, the result is updated in place. |
None |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrStripRule(BaseAssignColumnRule):
""" Strips leading, trailing or both whitespaces or other characters from the values in the input column.
Basic usage::
rule = StrStripRule("col_A", how="both")
rule.apply(data)
Args:
input_column (str): A input column to strip characters from its values.
how: How should the stripping be done. One of left, right, both.
Left strips leading characters, right trailing characters and both at both ends.
characters: If set, it contains a list of characters to be stripped.
When not specified or when set to None, whitespace is removed.
output_column (Optional[str]): An optional new column to hold the results.
If provided, the existing column is unchanged and a new column is created with the results.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
STRIP_LEFT = 'left'
STRIP_RIGHT = 'right'
STRIP_BOTH = 'both'
def __init__(self, input_column: str, how: Literal[STRIP_LEFT, STRIP_RIGHT, STRIP_BOTH]=STRIP_BOTH, characters: Optional[str]=None, output_column: Optional[str]=None, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(input_column=input_column, output_column=output_column, named_input=named_input, named_output=named_output,
name=name, description=description, strict=strict)
assert how in (self.STRIP_BOTH, self.STRIP_LEFT, self.STRIP_RIGHT), f"Unknown how parameter {how}. It must be one of: {(self.STRIP_BOTH, self.STRIP_LEFT, self.STRIP_RIGHT)}"
self.how = how
self.characters = characters or None
StrUpperRule (BaseAssignColumnRule)
¶
Converts the values in a string columns to upper case.
Basic usage::
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
input_column |
str |
A string column to convert to upper case. |
required |
output_column |
Optional[str] |
An optional new names for the column with the upper case values. If provided, the existing column is unchanged, and a new column is created with the upper case values. If not provided, the result is updated in place. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
required |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
required |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
required |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
required |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
required |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised if a column doesn't exist in the input dataframe. |
ColumnAlreadyExistsError |
raised in strict mode only if the output_column already exists in the dataframe. |
Note
In non-strict mode, the overwriting of existing columns is ignored.
Source code in etlrules/backends/common/strings.py
class StrUpperRule(BaseAssignColumnRule):
""" Converts the values in a string columns to upper case.
Basic usage::
rule = StrUpperRule("col_A")
rule.apply(data)
Args:
input_column (str): A string column to convert to upper case.
output_column (Optional[str]): An optional new names for the column with the upper case values.
If provided, the existing column is unchanged, and a new column is created with the upper case values.
If not provided, the result is updated in place.
named_input (Optional[str]): Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised if a column doesn't exist in the input dataframe.
ColumnAlreadyExistsError: raised in strict mode only if the output_column already exists in the dataframe.
Note:
In non-strict mode, the overwriting of existing columns is ignored.
"""
types
¶
TypeConversionRule (UnaryOpBaseRule)
¶
Converts the type of a given set of columns to other types.
Basic usage::
1 2 3 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mapper |
Mapping[str, str] |
A dict with columns names as keys and the new types as values. The supported types are: int8, int16, int32, int64, uint8, uint16, uint32, uint64, float32, float64, string, boolean, datetime and timedelta. |
required |
named_input |
Optional[str] |
Which dataframe to use as the input. Optional. When not set, the input is taken from the main output. Set it to a string value, the name of an output dataframe of a previous rule. |
None |
named_output |
Optional[str] |
Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. |
None |
name |
Optional[str] |
Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. |
None |
description |
Optional[str] |
Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. |
None |
strict |
bool |
When set to True, the rule does a stricter valiation. Default: True |
True |
Exceptions:
Type | Description |
---|---|
MissingColumnError |
raised when a column specified in the mapper doesn't exist in the input data frame. |
UnsupportedTypeError |
raised when an unknown type is speified in the values of the mapper. |
ValueError |
raised in strict mode if a value cannot be converted to the desired type. In non strict mode, the exception is not raised and the value is converted to NA. |
Source code in etlrules/backends/common/types.py
class TypeConversionRule(UnaryOpBaseRule):
""" Converts the type of a given set of columns to other types.
Basic usage::
# converts column A to int64, B to string and C to datetime
rule = TypeConversionRule({"A": "int64", "B": "string", "C": "datetime"})
rule.apply(data)
Args:
mapper: A dict with columns names as keys and the new types as values.
The supported types are: int8, int16, int32, int64, uint8, uint16,
uint32, uint64, float32, float64, string, boolean, datetime and timedelta.
named_input: Which dataframe to use as the input. Optional.
When not set, the input is taken from the main output.
Set it to a string value, the name of an output dataframe of a previous rule.
named_output: Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name: Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description: Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict: When set to True, the rule does a stricter valiation. Default: True
Raises:
MissingColumnError: raised when a column specified in the mapper doesn't exist in the input data frame.
UnsupportedTypeError: raised when an unknown type is speified in the values of the mapper.
ValueError: raised in strict mode if a value cannot be converted to the desired type.
In non strict mode, the exception is not raised and the value is converted to NA.
"""
def __init__(self, mapper: Mapping[str, str], named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
assert isinstance(mapper, dict), "mapper needs to be a dict {column_name:type}"
assert all(isinstance(key, str) and isinstance(val, str) for key, val in mapper.items()), "mapper needs to be a dict {column_name:type} where the names are str"
super().__init__(named_input=named_input, named_output=named_output, name=name, description=description, strict=strict)
self.mapper = mapper
for column_name, type_str in self.mapper.items():
if type_str not in SUPPORTED_TYPES:
raise UnsupportedTypeError(f"Type '{type_str}' for column '{column_name}' is not currently supported.")
def do_type_conversion(self, df, col, dtype):
raise NotImplementedError("Have you imported the rules from etlrules.backends.<your_backend> and not common?")
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
columns_set = set(df.columns)
for column_name in self.mapper:
if column_name not in columns_set:
raise MissingColumnError(f"Column '{column_name}' is missing in the data frame. Available columns: {sorted(columns_set)}")
df = self.assign_do_apply_dict(df, {
column_name: self.do_type_conversion(df, df[column_name], type_str)
for column_name, type_str in self.mapper.items()
})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/common/types.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
columns_set = set(df.columns)
for column_name in self.mapper:
if column_name not in columns_set:
raise MissingColumnError(f"Column '{column_name}' is missing in the data frame. Available columns: {sorted(columns_set)}")
df = self.assign_do_apply_dict(df, {
column_name: self.do_type_conversion(df, df[column_name], type_str)
for column_name, type_str in self.mapper.items()
})
self._set_output_df(data, df)
dask
special
¶
basic
¶
ExplodeValuesRule (ExplodeValuesRule)
¶
Source code in etlrules/backends/dask/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column)
if self.column_type:
result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
self._set_output_df(data, result)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/basic.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column)
if self.column_type:
result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
self._set_output_df(data, result)
conditions
¶
FilterRule (FilterRule)
¶
Source code in etlrules/backends/dask/conditions.py
class FilterRule(FilterRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename="FilterRule.py")
def apply(self, data):
df = self._get_input_df(data)
cond_series = self._condition_expression.eval(df)
if self.discard_matching_rows:
cond_series = ~cond_series
self._set_output_df(data, df[cond_series].reset_index(drop=True))
if self.named_output_discarded:
data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/conditions.py
def apply(self, data):
df = self._get_input_df(data)
cond_series = self._condition_expression.eval(df)
if self.discard_matching_rows:
cond_series = ~cond_series
self._set_output_df(data, df[cond_series].reset_index(drop=True))
if self.named_output_discarded:
data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
IfThenElseRule (IfThenElseRule)
¶
Source code in etlrules/backends/dask/conditions.py
class IfThenElseRule(IfThenElseRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename=f'{self.output_column}.py')
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
cond_series = self._condition_expression.eval(df)
then_value = self.then_value if self.then_value is not None else df[self.then_column]
else_value = self.else_value if self.else_value is not None else df[self.else_column]
df = df.assign(**{self.output_column: then_value})
df = df.assign(**{self.output_column: df[self.output_column].where(cond_series, else_value)})
if (isinstance(then_value, str) or isinstance(else_value, str)) and len(df.index) == 0:
df = df.astype({self.output_column: "string"})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/conditions.py
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
cond_series = self._condition_expression.eval(df)
then_value = self.then_value if self.then_value is not None else df[self.then_column]
else_value = self.else_value if self.else_value is not None else df[self.else_column]
df = df.assign(**{self.output_column: then_value})
df = df.assign(**{self.output_column: df[self.output_column].where(cond_series, else_value)})
if (isinstance(then_value, str) or isinstance(else_value, str)) and len(df.index) == 0:
df = df.astype({self.output_column: "string"})
self._set_output_df(data, df)
datetime
¶
DateTimeLocalNowRule (DateTimeLocalNowRule)
¶
Source code in etlrules/backends/dask/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.now()})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.now()})
self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
¶
Source code in etlrules/backends/dask/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
self._set_output_df(data, df)
io
special
¶
db
¶
WriteSQLTableRule (WriteSQLTableRule)
¶
Source code in etlrules/backends/dask/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):
METHOD = 'multi'
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
import sqlalchemy as sa
try:
df.to_sql(
self._get_sql_table(),
self._get_sql_engine(),
if_exists=self.if_exists,
index=False,
method=self.METHOD
)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/io/db.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
import sqlalchemy as sa
try:
df.to_sql(
self._get_sql_table(),
self._get_sql_engine(),
if_exists=self.if_exists,
index=False,
method=self.METHOD
)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
newcolumns
¶
AddNewColumnRule (AddNewColumnRule)
¶
Source code in etlrules/backends/dask/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):
def get_column_expression(self):
return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
result = self._column_expression.eval(df)
if self.column_type:
try:
result = result.astype(MAP_TYPES[self.column_type])
except ValueError as exc:
raise TypeError(str(exc))
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
result = self._column_expression.eval(df)
if self.column_type:
try:
result = result.astype(MAP_TYPES[self.column_type])
except ValueError as exc:
raise TypeError(str(exc))
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
¶
Source code in etlrules/backends/dask/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + len(df.index) * self.step
result = da.arange(self.start, stop, self.step)
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + len(df.index) * self.step
result = da.arange(self.start, stop, self.step)
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
strings
¶
StrExtractRule (StrExtractRule, DaskMixin)
¶
Source code in etlrules/backends/dask/strings.py
class StrExtractRule(StrExtractRuleBase, DaskMixin):
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
new_cols_dict = {}
groups = self._compiled_expr.groups
for idx, col in enumerate(columns):
new_col = df[col].str.extract(self._compiled_expr, expand=True)
for group in range(groups):
new_column = new_col[group]
if group == 0 and self.keep_original_value:
# only the first new column keeps the value (in case of multiple groups)
new_column = new_column.fillna(value=df[col])
new_cols_dict[output_columns[idx * groups + group]] = new_column
df = df.assign(**new_cols_dict)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/dask/strings.py
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
new_cols_dict = {}
groups = self._compiled_expr.groups
for idx, col in enumerate(columns):
new_col = df[col].str.extract(self._compiled_expr, expand=True)
for group in range(groups):
new_column = new_col[group]
if group == 0 and self.keep_original_value:
# only the first new column keeps the value (in case of multiple groups)
new_column = new_column.fillna(value=df[col])
new_cols_dict[output_columns[idx * groups + group]] = new_column
df = df.assign(**new_cols_dict)
self._set_output_df(data, df)
pandas
special
¶
basic
¶
ExplodeValuesRule (ExplodeValuesRule)
¶
Source code in etlrules/backends/pandas/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column, ignore_index=True)
if self.column_type:
result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
self._set_output_df(data, result)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/basic.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column, ignore_index=True)
if self.column_type:
result = result.astype({self.input_column: MAP_TYPES[self.column_type]})
self._set_output_df(data, result)
conditions
¶
FilterRule (FilterRule)
¶
Source code in etlrules/backends/pandas/conditions.py
class FilterRule(FilterRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename="FilterRule.py")
def apply(self, data):
df = self._get_input_df(data)
cond_series = self._condition_expression.eval(df)
if self.discard_matching_rows:
cond_series = ~cond_series
self._set_output_df(data, df[cond_series].reset_index(drop=True))
if self.named_output_discarded:
data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/conditions.py
def apply(self, data):
df = self._get_input_df(data)
cond_series = self._condition_expression.eval(df)
if self.discard_matching_rows:
cond_series = ~cond_series
self._set_output_df(data, df[cond_series].reset_index(drop=True))
if self.named_output_discarded:
data.set_named_output(self.named_output_discarded, df[~cond_series].reset_index(drop=True))
IfThenElseRule (IfThenElseRule)
¶
Source code in etlrules/backends/pandas/conditions.py
class IfThenElseRule(IfThenElseRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename=f'{self.output_column}.py')
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
cond_series = self._condition_expression.eval(df)
then_value = self.then_value if self.then_value is not None else df[self.then_column]
else_value = self.else_value if self.else_value is not None else df[self.else_column]
result = np.where(cond_series, then_value, else_value)
df = df.assign(**{self.output_column: result})
if df.empty and (isinstance(then_value, str) or isinstance(else_value, str)):
df = df.astype({self.output_column: "string"})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/conditions.py
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
cond_series = self._condition_expression.eval(df)
then_value = self.then_value if self.then_value is not None else df[self.then_column]
else_value = self.else_value if self.else_value is not None else df[self.else_column]
result = np.where(cond_series, then_value, else_value)
df = df.assign(**{self.output_column: result})
if df.empty and (isinstance(then_value, str) or isinstance(else_value, str)):
df = df.astype({self.output_column: "string"})
self._set_output_df(data, df)
datetime
¶
DateTimeLocalNowRule (DateTimeLocalNowRule)
¶
Source code in etlrules/backends/pandas/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.now()})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.now()})
self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
¶
Source code in etlrules/backends/pandas/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.assign(**{self.output_column: datetime.datetime.utcnow()})
self._set_output_df(data, df)
io
special
¶
db
¶
WriteSQLTableRule (WriteSQLTableRule)
¶
Source code in etlrules/backends/pandas/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):
METHOD = 'multi'
def _do_apply(self, connection, df):
df.to_sql(
self._get_sql_table(),
connection,
if_exists=self.if_exists,
index=False,
method=self.METHOD
)
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
engine = SQLAlchemyEngines.get_engine(self._get_sql_engine())
import sqlalchemy as sa
with engine.connect() as connection:
try:
self._do_apply(connection, df)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
connection.commit()
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/io/db.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
engine = SQLAlchemyEngines.get_engine(self._get_sql_engine())
import sqlalchemy as sa
with engine.connect() as connection:
try:
self._do_apply(connection, df)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
connection.commit()
newcolumns
¶
AddNewColumnRule (AddNewColumnRule)
¶
Source code in etlrules/backends/pandas/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):
def get_column_expression(self):
return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
result = self._column_expression.eval(df)
if self.column_type:
try:
result = result.astype(MAP_TYPES[self.column_type])
except ValueError as exc:
raise TypeError(str(exc))
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
result = self._column_expression.eval(df)
if self.column_type:
try:
result = result.astype(MAP_TYPES[self.column_type])
except ValueError as exc:
raise TypeError(str(exc))
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
¶
Source code in etlrules/backends/pandas/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + df.shape[0] * self.step
result = np.arange(start=self.start, stop=stop, step=self.step)
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + df.shape[0] * self.step
result = np.arange(start=self.start, stop=stop, step=self.step)
df = df.assign(**{self.output_column: result})
self._set_output_df(data, df)
strings
¶
StrExtractRule (StrExtractRule, PandasMixin)
¶
Source code in etlrules/backends/pandas/strings.py
class StrExtractRule(StrExtractRuleBase, PandasMixin):
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
new_cols_dict = {}
groups = self._compiled_expr.groups
for idx, col in enumerate(columns):
new_col = df[col].str.extract(self._compiled_expr, expand=True)
if self.keep_original_value:
# only the first new column keeps the value (in case of multiple groups)
new_col[0].fillna(value=df[col], inplace=True)
for group in range(groups):
new_cols_dict[output_columns[idx * groups + group]] = new_col[group]
df = df.assign(**new_cols_dict)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/pandas/strings.py
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
new_cols_dict = {}
groups = self._compiled_expr.groups
for idx, col in enumerate(columns):
new_col = df[col].str.extract(self._compiled_expr, expand=True)
if self.keep_original_value:
# only the first new column keeps the value (in case of multiple groups)
new_col[0].fillna(value=df[col], inplace=True)
for group in range(groups):
new_cols_dict[output_columns[idx * groups + group]] = new_col[group]
df = df.assign(**new_cols_dict)
self._set_output_df(data, df)
polars
special
¶
basic
¶
ExplodeValuesRule (ExplodeValuesRule)
¶
Source code in etlrules/backends/polars/basic.py
class ExplodeValuesRule(ExplodeValuesRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column)
if self.column_type:
result = result.with_columns(
**{self.input_column: pl.col(self.input_column).cast(MAP_TYPES[self.column_type])}
)
self._set_output_df(data, result)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/basic.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_input_column(df)
result = df.explode(self.input_column)
if self.column_type:
result = result.with_columns(
**{self.input_column: pl.col(self.input_column).cast(MAP_TYPES[self.column_type])}
)
self._set_output_df(data, result)
conditions
¶
FilterRule (FilterRule)
¶
Source code in etlrules/backends/polars/conditions.py
class FilterRule(FilterRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename="FilterRule.py")
def apply(self, data):
df = self._get_input_df(data)
try:
cond_series = self._condition_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
if self.discard_matching_rows:
cond_series = ~cond_series
result = df.filter(cond_series)
self._set_output_df(data, result)
if self.named_output_discarded:
discarded_result = df.filter(~cond_series)
data.set_named_output(self.named_output_discarded, discarded_result)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/conditions.py
def apply(self, data):
df = self._get_input_df(data)
try:
cond_series = self._condition_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
if self.discard_matching_rows:
cond_series = ~cond_series
result = df.filter(cond_series)
self._set_output_df(data, result)
if self.named_output_discarded:
discarded_result = df.filter(~cond_series)
data.set_named_output(self.named_output_discarded, discarded_result)
IfThenElseRule (IfThenElseRule)
¶
Source code in etlrules/backends/polars/conditions.py
class IfThenElseRule(IfThenElseRuleBase):
def get_condition_expression(self):
return Expression(self.condition_expression, filename=f'{self.output_column}.py')
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
try:
cond_series = self._condition_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
then_value = pl.lit(self.then_value) if self.then_value is not None else pl.col(self.then_column)
else_value = pl.lit(self.else_value) if self.else_value is not None else pl.col(self.else_column)
result = pl.when(cond_series).then(then_value).otherwise(else_value)
df = df.with_columns(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/conditions.py
def apply(self, data):
df = self._get_input_df(data)
df_columns = set(df.columns)
self._validate_columns(df_columns)
try:
cond_series = self._condition_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
then_value = pl.lit(self.then_value) if self.then_value is not None else pl.col(self.then_column)
else_value = pl.lit(self.else_value) if self.else_value is not None else pl.col(self.else_column)
result = pl.when(cond_series).then(then_value).otherwise(else_value)
df = df.with_columns(**{self.output_column: result})
self._set_output_df(data, df)
datetime
¶
DateTimeLocalNowRule (DateTimeLocalNowRule)
¶
Source code in etlrules/backends/polars/datetime.py
class DateTimeLocalNowRule(DateTimeLocalNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.with_columns(
pl.lit(datetime.datetime.now()).alias(self.output_column)
)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.with_columns(
pl.lit(datetime.datetime.now()).alias(self.output_column)
)
self._set_output_df(data, df)
DateTimeUTCNowRule (DateTimeUTCNowRule)
¶
Source code in etlrules/backends/polars/datetime.py
class DateTimeUTCNowRule(DateTimeUTCNowRuleBase):
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.with_columns(
pl.lit(datetime.datetime.utcnow()).alias(self.output_column)
)
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/datetime.py
def apply(self, data):
df = self._get_input_df(data)
if self.strict and self.output_column in df.columns:
raise ColumnAlreadyExistsError(f"{self.output_column} already exists in the input dataframe.")
df = df.with_columns(
pl.lit(datetime.datetime.utcnow()).alias(self.output_column)
)
self._set_output_df(data, df)
io
special
¶
db
¶
WriteSQLTableRule (WriteSQLTableRule)
¶
Source code in etlrules/backends/polars/io/db.py
class WriteSQLTableRule(WriteSQLTableRuleBase):
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
import sqlalchemy as sa
try:
df.write_database(
self._get_sql_table(),
self._get_sql_engine(),
if_exists=self.if_exists
)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
apply(self, data)
¶Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/io/db.py
def apply(self, data):
super().apply(data)
df = self._get_input_df(data)
import sqlalchemy as sa
try:
df.write_database(
self._get_sql_table(),
self._get_sql_engine(),
if_exists=self.if_exists
)
except sa.exc.SQLAlchemyError as exc:
raise SQLError(str(exc))
newcolumns
¶
AddNewColumnRule (AddNewColumnRule)
¶
Source code in etlrules/backends/polars/newcolumns.py
class AddNewColumnRule(AddNewColumnRuleBase):
def get_column_expression(self):
return Expression(self.column_expression, filename=f'{self.output_column}_expression.py')
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
try:
result = self._column_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
if self.column_type:
try:
result = result.cast(MAP_TYPES[self.column_type])
except pl.exceptions.ComputeError as exc:
raise TypeError(exc)
df = df.with_columns(**{self.output_column: result})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
try:
result = self._column_expression.eval(df)
except pl.exceptions.ColumnNotFoundError as exc:
raise KeyError(str(exc))
if self.column_type:
try:
result = result.cast(MAP_TYPES[self.column_type])
except pl.exceptions.ComputeError as exc:
raise TypeError(exc)
df = df.with_columns(**{self.output_column: result})
self._set_output_df(data, df)
AddRowNumbersRule (AddRowNumbersRule)
¶
Source code in etlrules/backends/polars/newcolumns.py
class AddRowNumbersRule(AddRowNumbersRuleBase):
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + len(df) * self.step
df = df.with_columns(**{self.output_column: pl.arange(start=self.start, end=stop, step=self.step)})
self._set_output_df(data, df)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/newcolumns.py
def apply(self, data):
df = self._get_input_df(data)
self._validate_columns(df.columns)
stop = self.start + len(df) * self.step
df = df.with_columns(**{self.output_column: pl.arange(start=self.start, end=stop, step=self.step)})
self._set_output_df(data, df)
strings
¶
StrExtractRule (StrExtractRule, PolarsMixin)
¶
Source code in etlrules/backends/polars/strings.py
class StrExtractRule(StrExtractRuleBase, PolarsMixin):
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
groups = self._compiled_expr.groups
input_column = columns[0]
ordered_cols = [col for col in df.columns]
ordered_cols += [col for col in output_columns if col not in ordered_cols]
if self.keep_original_value:
res = df.with_columns(
pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
).select(
*([col for col in df.columns] + [pl.col("_tmp_col").struct[i].alias("_tmp_col2" if i == 0 else output_columns[i]) for i in range(groups)])
)
res= res.with_columns(
pl.when(
pl.col("_tmp_col2").is_null()
).then(pl.col(input_column)).otherwise(pl.col("_tmp_col2")).alias(output_columns[0])
)
else:
res = df.with_columns(
pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
).select(
*([col for col in df.columns if col not in output_columns] + [pl.col("_tmp_col").struct[i].alias(output_columns[i]) for i in range(groups)])
)
res = res[ordered_cols]
self._set_output_df(data, res)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/backends/polars/strings.py
def apply(self, data):
df = self._get_input_df(data)
columns, output_columns = self.validate_columns_in_out(df.columns, [self.input_column], self.output_columns, self.strict, validate_length=False)
groups = self._compiled_expr.groups
input_column = columns[0]
ordered_cols = [col for col in df.columns]
ordered_cols += [col for col in output_columns if col not in ordered_cols]
if self.keep_original_value:
res = df.with_columns(
pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
).select(
*([col for col in df.columns] + [pl.col("_tmp_col").struct[i].alias("_tmp_col2" if i == 0 else output_columns[i]) for i in range(groups)])
)
res= res.with_columns(
pl.when(
pl.col("_tmp_col2").is_null()
).then(pl.col(input_column)).otherwise(pl.col("_tmp_col2")).alias(output_columns[0])
)
else:
res = df.with_columns(
pl.col(input_column).str.extract_groups(self.regular_expression).alias("_tmp_col")
).select(
*([col for col in df.columns if col not in output_columns] + [pl.col("_tmp_col").struct[i].alias(output_columns[i]) for i in range(groups)])
)
res = res[ordered_cols]
self._set_output_df(data, res)
engine
¶
RuleEngine
¶
Run a set of extract/transform/load rules over a dataframe.
Takes in a plan with the definition of the extract/transform/load rules and it runs it over a RuleData instance. The RuleData instance can be optionally pre-populated with a input dataframe (in pipeline mode) or a sequence of named inputs (named dataframes).
The plan can have rules to extract data (ie add more dataframes to the RuleData). It can have transform rules which will transform the existing dataframes (either in-place or produce new named dataframes). It can also have rules to load data into external systems, e.g. files, databases, API connections, etc.
At the end of a plan run, the RuleData instance passed in will contain the results of the run (ie new dataframes/transformed dataframes) which can be inspected/operated on outside of the rule engine.
Source code in etlrules/engine.py
class RuleEngine:
""" Run a set of extract/transform/load rules over a dataframe.
Takes in a plan with the definition of the extract/transform/load rules and it
runs it over a RuleData instance. The RuleData instance can be optionally pre-populated with
a input dataframe (in pipeline mode) or a sequence of named inputs (named dataframes).
The plan can have rules to extract data (ie add more dataframes to the RuleData). It can have
transform rules which will transform the existing dataframes (either in-place or produce new
named dataframes). It can also have rules to load data into external systems, e.g. files,
databases, API connections, etc.
At the end of a plan run, the RuleData instance passed in will contain the results of the run
(ie new dataframes/transformed dataframes) which can be inspected/operated on outside of the
rule engine.
"""
def __init__(self, plan: Plan):
assert isinstance(plan, Plan)
self.plan = plan
def _get_context(self, data: RuleData) -> dict[str, Union[str, int, float, bool]]:
context = {}
context.update(self.plan.get_context())
context.update(data.get_context())
return context
def run_pipeline(self, data: RuleData) -> RuleData:
with context.set(self._get_context(data)):
for rule in self.plan:
rule.apply(data)
return data
def _get_topological_sorter(self, data: RuleData) -> graphlib.TopologicalSorter:
g = graphlib.TopologicalSorter()
existing_named_outputs = set(name for name, _ in data.get_named_outputs())
named_outputs = {}
for idx, rule in enumerate(self.plan):
if rule.has_output():
named_outputs_lst = list(rule.get_all_named_outputs())
if not named_outputs_lst:
raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has no named outputs.")
for named_output in named_outputs_lst:
if named_output is None:
raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has empty named output.")
existing_rule = named_outputs.get(named_output)
if existing_rule is not None:
raise InvalidPlanError(f"Named output '{named_output}' is produced by multiple rules: {rule.__class__}/(name={rule.get_name()}) and {existing_rule[1].__class__}/(name={existing_rule[1].get_name()})")
named_outputs[named_output] = (idx, rule)
named_output_clashes = existing_named_outputs & set(named_outputs.keys())
if named_output_clashes:
raise GraphRuntimeError(f"Named output clashes. The following named outputs are produced by rules in the plan but they also exist in the input data, leading to ambiguity: {named_output_clashes}")
for idx, rule in enumerate(self.plan):
if rule.has_input():
named_inputs = list(rule.get_all_named_inputs())
if not named_inputs:
raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has no named inputs.")
for named_input in named_inputs:
if named_input is None:
raise InvalidPlanError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) has empty named input.")
if named_input in named_outputs:
g.add(idx, named_outputs[named_input][0])
elif named_input not in existing_named_outputs:
raise GraphRuntimeError(f"Rule {rule.__class__}/(name={rule.get_name()}, index={idx}) requires a named_input={named_input} which doesn't exist in the input data and it's not produced as a named output by any of the rules in the graph.")
else:
g.add(idx)
else:
g.add(idx)
return g
def run_graph(self, data: RuleData) -> RuleData:
g = self._get_topological_sorter(data)
g.prepare()
with context.set(self._get_context(data)):
while g.is_active():
for rule_idx in g.get_ready():
rule = self.plan.get_rule(rule_idx)
rule.apply(data)
g.done(rule_idx)
return data
def validate_pipeline(self, data: RuleData) -> Tuple[bool, Optional[str]]:
return True, None
def validate_graph(self, data: RuleData) -> Tuple[bool, Optional[str]]:
try:
self._get_topological_sorter(data)
except (InvalidPlanError, GraphRuntimeError) as exc:
return False, str(exc)
return True, None
def validate(self, data: RuleData) -> Tuple[bool, Optional[str]]:
assert isinstance(data, RuleData)
if self.plan.is_empty():
return False, "An empty plan cannot be run."
mode = self.plan.get_mode()
if mode == PlanMode.PIPELINE:
return self.validate_pipeline(data)
elif mode == PlanMode.GRAPH:
return self.validate_graph(data)
return False, "Plan's mode cannot be determined."
def run(self, data: RuleData) -> RuleData:
assert isinstance(data, RuleData)
if self.plan.is_empty():
raise InvalidPlanError("An empty plan cannot be run.")
mode = self.plan.get_mode()
if mode == PlanMode.PIPELINE:
return self.run_pipeline(data)
elif mode == PlanMode.GRAPH:
return self.run_graph(data)
else:
raise InvalidPlanError("Plan's mode cannot be determined.")
exceptions
¶
ColumnAlreadyExistsError (Exception)
¶
An attempt to create a column that already exists in the dataframe.
Source code in etlrules/exceptions.py
class ColumnAlreadyExistsError(Exception):
""" An attempt to create a column that already exists in the dataframe. """
ExpressionSyntaxError (SyntaxError)
¶
A Python expression used to create a column, aggregate or other operations has a syntax error.
Source code in etlrules/exceptions.py
class ExpressionSyntaxError(SyntaxError):
""" A Python expression used to create a column, aggregate or other operations has a syntax error. """
GraphRuntimeError (RuntimeError)
¶
There was an error when running a graph-mode plan.
Source code in etlrules/exceptions.py
class GraphRuntimeError(RuntimeError):
""" There was an error when running a graph-mode plan. """
InvalidPlanError (Exception)
¶
The plan failed validation.
Source code in etlrules/exceptions.py
class InvalidPlanError(Exception):
""" The plan failed validation. """
MissingColumnError (Exception)
¶
An operation is being applied to a column that is not present in the input data frame.
Source code in etlrules/exceptions.py
class MissingColumnError(Exception):
""" An operation is being applied to a column that is not present in the input data frame. """
SQLError (RuntimeError)
¶
There was an error during the execution of a sql statement.
Source code in etlrules/exceptions.py
class SQLError(RuntimeError):
""" There was an error during the execution of a sql statement. """
SchemaError (Exception)
¶
An operation needs a certain schema for the dataframe which is not present.
Source code in etlrules/exceptions.py
class SchemaError(Exception):
""" An operation needs a certain schema for the dataframe which is not present. """
UnsupportedTypeError (Exception)
¶
A type conversion is attempted to a type that is not supported.
Source code in etlrules/exceptions.py
class UnsupportedTypeError(Exception):
""" A type conversion is attempted to a type that is not supported. """
plan
¶
Plan
¶
A plan to manipulate one or multiple dataframes with a set of rules.
A plan is a blueprint on how to extract one or more dataframes from various sources (e.g. files or other data sources), how to transform those dataframes by adding calculated columns, joining different dataframe, aggregating, sorting, etc. and ultimately how to load that into a data store (files or other data stores).
A plan can operate in two modes: pipeline or graph. A pipeline graph is a simple type of plan where each rule take its input from the previous rule's output. A graph plan is more complex as it allows rules to produce named outputs which can then be used by other rules. This ultimately builds a dag (directed acyclic graph) of rule dependencies. A graph allows branching and joining back allowing complex logic. Rules are executed in the order of dependency and not in the order they are added to the plan. By comparison, pipelines implement a single input/single output mode where rules are executed in the order they are added to the plan.
Pipeline example::
1 2 3 4 |
|
Graph example::
1 2 3 4 |
|
Note
Rules that are used in graph mode should take a named_input and produce a named_output. Rules that use the pipeline mode must not used named inputs/outputs. The two type of rules cannot be used in the same plan as that leads to ambiguity.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
mode |
Optional[Literal['pipeline', 'graph']] |
One of pipeline or graph, the type of the graph. Optional. In pipeline mode, rules don't use named inputs/outputs and they are run in the same order they are added to the plan, with each rule taking the input from the previous rule. In graph mode, rules use named inputs/outputs which create a directed acyclical graph of dependency. The rules are run in the order of dependency. When not specified, it is inferred from the first rule in the plan. |
None |
name |
Optional[str] |
A name for the plan. Optional. |
None |
description |
Optional[str] |
An optional documentation for the plan. This can include what the plan does, its purpose and detailed information about how it works. |
None |
context |
Optional[Mapping[str, Union[str, int, float, bool]]] |
An optional key-value mapping which can be used in rules via string substitutions. It can be used as arguments into the plan to tweak the running of the plan by providing different values for certain arguments with each run. The types of the values can be: strings, int, float, boolean (True or False). |
None |
strict |
Optional[bool] |
A hint about how the plan should be executed. When None, then the plan has no hint to provide and its the caller deciding whether to run it in a strict mode or not. |
None |
Exceptions:
Type | Description |
---|---|
InvalidPlanError |
if pipeline mode rules are mixed with graph mode rules |
Source code in etlrules/plan.py
class Plan:
""" A plan to manipulate one or multiple dataframes with a set of rules.
A plan is a blueprint on how to extract one or more dataframes from various sources (e.g. files or
other data sources), how to transform those dataframes by adding calculated columns, joining
different dataframe, aggregating, sorting, etc. and ultimately how to load that into a data store
(files or other data stores).
A plan can operate in two modes: pipeline or graph. A pipeline graph is a simple type of plan where
each rule take its input from the previous rule's output. A graph plan is more complex as it allows
rules to produce named outputs which can then be used by other rules. This ultimately builds a dag
(directed acyclic graph) of rule dependencies. A graph allows branching and joining back allowing
complex logic. Rules are executed in the order of dependency and not in the order they are added to
the plan. By comparison, pipelines implement a single input/single output mode where rules are
executed in the order they are added to the plan.
Pipeline example::
plan = Plan()
plan.add_rule(SortRule(['A']))
plan.add_rule(ProjectRule(['A', 'B']))
plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}))
Graph example::
plan = Plan()
plan.add_rule(SortRule(['A'], named_input="input", named_output="sorted_data"))
plan.add_rule(ProjectRule(['A', 'B'], named_input="sorted_data", named_output="projected_data"))
plan.add_rule(RenameRule({'A': 'AA', 'B': 'BB'}, named_input="projected_data", named_output="renamed_data"))
Note:
Rules that are used in graph mode should take a named_input and produce a named_output. Rules
that use the pipeline mode must not used named inputs/outputs. The two type of rules cannot be
used in the same plan as that leads to ambiguity.
Args:
mode: One of pipeline or graph, the type of the graph. Optional.
In pipeline mode, rules don't use named inputs/outputs and they are run in the same order they are
added to the plan, with each rule taking the input from the previous rule.
In graph mode, rules use named inputs/outputs which create a directed acyclical graph of
dependency. The rules are run in the order of dependency.
When not specified, it is inferred from the first rule in the plan.
name: A name for the plan. Optional.
description: An optional documentation for the plan.
This can include what the plan does, its purpose and detailed information about how it works.
context: An optional key-value mapping which can be used in rules via string substitutions.
It can be used as arguments into the plan to tweak the running of the plan by providing different
values for certain arguments with each run.
The types of the values can be: strings, int, float, boolean (True or False).
strict: A hint about how the plan should be executed.
When None, then the plan has no hint to provide and its the caller deciding whether to run it
in a strict mode or not.
Raises:
InvalidPlanError: if pipeline mode rules are mixed with graph mode rules
"""
def __init__(
self,
mode: Optional[Literal['pipeline', 'graph']]=None,
name: Optional[str]=None,
description: Optional[str]=None,
context: Optional[Mapping[str, Union[str, int, float, bool]]]=None,
strict: Optional[bool]=None
):
self.mode = mode
self.name = name
self.description = description
self.context = {k: v for k, v in context.items()} if context is not None else {}
self.strict = strict
self.rules = []
def _check_plan_mode(self, rule: BaseRule):
mode = self.get_mode()
if mode is not None:
_new_rule_mode = plan_mode_from_rule(rule)
if _new_rule_mode is not None and mode != _new_rule_mode:
raise InvalidPlanError(f"Mixing of rules taking named inputs and rules with no named inputs is not supported. ({mode} vs. {rule.__class__}'s mode {_new_rule_mode})")
def get_mode(self) -> Optional[Literal['pipeline', 'graph']]:
""" Return the mode (pipeline or graph) of the plan. """
if self.mode is None:
self.mode = plan_mode_from_rules(self.rules)
return self.mode
def get_context(self) -> dict[str, Union[str, int, float, bool]]:
return self.context
def add_rule(self, rule: BaseRule) -> None:
""" Add a new rule to the plan.
Args:
rule: A rule instance to add to the plan
Raises:
InvalidPlanError: if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them)
"""
assert isinstance(rule, BaseRule)
self._check_plan_mode(rule)
self.rules.append(rule)
def __iter__(self):
yield from self.rules
def get_rule(self, idx: int) -> BaseRule:
""" Return the rule at a certain index as per order of addition to the plan. """
return self.rules[idx]
def is_empty(self) -> bool:
""" Return True if the plan has no rules, False otherwise.
Returns:
A boolean to indicate if the plan is empty.
"""
return not self.rules
def to_dict(self) -> dict:
""" Serialize the plan to a dict.
Returns:
A dictionary with the plan representation.
"""
rules = [rule.to_dict() for rule in self.rules]
return {
"name": self.name,
"description": self.description,
"context": self.context,
"strict": self.strict,
"rules": rules
}
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
""" Creates a plan instance from a python dictionary.
Args:
dct: A dictionary to create the plan from
backend: One of the supported backends (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
instance = Plan(
name=dct.get("name"),
description=dct.get("description"),
context=dct.get("context"),
strict=dct.get("strict"),
)
rules = dct.get("rules", ())
for rule in rules:
instance.add_rule(BaseRule.from_dict(rule, backend, additional_packages))
return instance
def to_yaml(self) -> str:
""" Serialize the plan to yaml. """
return yaml.safe_dump(self.to_dict())
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
""" Creates a plan from a yaml definition.
Args:
yml: The yaml string to create the plan from
backend: A supported backend (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
dct = yaml.safe_load(yml)
return cls.from_dict(dct, backend, additional_packages)
def __eq__(self, other: 'Plan') -> bool:
return (
type(self) == type(other) and
self.name == other.name and self.description == other.description and
self.strict == other.strict and self.rules == other.rules
)
add_rule(self, rule)
¶
Add a new rule to the plan.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
rule |
BaseRule |
A rule instance to add to the plan |
required |
Exceptions:
Type | Description |
---|---|
InvalidPlanError |
if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them) |
Source code in etlrules/plan.py
def add_rule(self, rule: BaseRule) -> None:
""" Add a new rule to the plan.
Args:
rule: A rule instance to add to the plan
Raises:
InvalidPlanError: if the rules are mixed (pipeline vs. graph - ie. mixing use of named inputs/outputs and not using them)
"""
assert isinstance(rule, BaseRule)
self._check_plan_mode(rule)
self.rules.append(rule)
from_dict(dct, backend, additional_packages=None)
classmethod
¶
Creates a plan instance from a python dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dct |
dict |
A dictionary to create the plan from |
required |
backend |
str |
One of the supported backends (ie pandas) |
required |
additional_packages |
Optional[Sequence[str]] |
Optional list of other packages to look for rules in |
None |
Returns:
Type | Description |
---|---|
Plan |
A new instance of a Plan. |
Source code in etlrules/plan.py
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
""" Creates a plan instance from a python dictionary.
Args:
dct: A dictionary to create the plan from
backend: One of the supported backends (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
instance = Plan(
name=dct.get("name"),
description=dct.get("description"),
context=dct.get("context"),
strict=dct.get("strict"),
)
rules = dct.get("rules", ())
for rule in rules:
instance.add_rule(BaseRule.from_dict(rule, backend, additional_packages))
return instance
from_yaml(yml, backend, additional_packages=None)
classmethod
¶
Creates a plan from a yaml definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
yml |
str |
The yaml string to create the plan from |
required |
backend |
str |
A supported backend (ie pandas) |
required |
additional_packages |
Optional[Sequence[str]] |
Optional list of other packages to look for rules in |
None |
Returns:
Type | Description |
---|---|
Plan |
A new instance of a Plan. |
Source code in etlrules/plan.py
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'Plan':
""" Creates a plan from a yaml definition.
Args:
yml: The yaml string to create the plan from
backend: A supported backend (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
dct = yaml.safe_load(yml)
return cls.from_dict(dct, backend, additional_packages)
get_mode(self)
¶
Return the mode (pipeline or graph) of the plan.
Source code in etlrules/plan.py
def get_mode(self) -> Optional[Literal['pipeline', 'graph']]:
""" Return the mode (pipeline or graph) of the plan. """
if self.mode is None:
self.mode = plan_mode_from_rules(self.rules)
return self.mode
get_rule(self, idx)
¶
Return the rule at a certain index as per order of addition to the plan.
Source code in etlrules/plan.py
def get_rule(self, idx: int) -> BaseRule:
""" Return the rule at a certain index as per order of addition to the plan. """
return self.rules[idx]
is_empty(self)
¶
Return True if the plan has no rules, False otherwise.
Returns:
Type | Description |
---|---|
bool |
A boolean to indicate if the plan is empty. |
Source code in etlrules/plan.py
def is_empty(self) -> bool:
""" Return True if the plan has no rules, False otherwise.
Returns:
A boolean to indicate if the plan is empty.
"""
return not self.rules
to_dict(self)
¶
Serialize the plan to a dict.
Returns:
Type | Description |
---|---|
dict |
A dictionary with the plan representation. |
Source code in etlrules/plan.py
def to_dict(self) -> dict:
""" Serialize the plan to a dict.
Returns:
A dictionary with the plan representation.
"""
rules = [rule.to_dict() for rule in self.rules]
return {
"name": self.name,
"description": self.description,
"context": self.context,
"strict": self.strict,
"rules": rules
}
to_yaml(self)
¶
Serialize the plan to yaml.
Source code in etlrules/plan.py
def to_yaml(self) -> str:
""" Serialize the plan to yaml. """
return yaml.safe_dump(self.to_dict())
rule
¶
BaseRule
¶
The base class for all rules.
Derive your custom rules from BaseRule in order to use them in a plan. Implement the following methods as needed: apply: mandatory, it implements the functionality of the rule
defaults to True, override and return False if your rule reads data
into the plan and therefore has no other dataframe input
defaults to True, override and return False if your rule writes data
to a persistent repository and therefore has no dataframe output
get_all_named_inputs: override to return the named inputs (if any) as strings get_all_named_outputs: override in case of multiple named outputs and return them as strings
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional. When not set, the result of this rule will be available as the main output. When set to a name (string), the result will be available as that named output. name (Optional[str]): Give the rule a name. Optional. Named rules are more descriptive as to what they're trying to do/the intent. description (Optional[str]): Describe in detail what the rules does, how it does it. Optional. Together with the name, the description acts as the documentation of the rule. strict (bool): When set to True, the rule does a stricter valiation. Default: True
Note
Add any class data members to the following list/tuples if needed:
Used in implementing equality between rules. Equality is
mostly used in tests. By default, equality looks at all data members in the class' dict. You can exclude calculated or transient data members which should be excluded from equality. Alternatively, you can implement eq in your own class and not rely on the eq implementation in the base class.
Used to exclude data members from the serialization to
dict and yaml. The serialization is implemented generically in the base class to serialize all data members in the class' dict which do not start with an underscore. See the note on serialization below.
Note
When implementing serialization, the arguments into your class should be saved as they are in data members with the same name as the arguments. This is because the de-serialization passes those as args into the init. As such, make sure to use the same names and to exclude data members which are not in the init from serialization by adding them to EXCLUDE_FROM_SERIALIZE.
Source code in etlrules/rule.py
class BaseRule:
""" The base class for all rules.
Derive your custom rules from BaseRule in order to use them in a plan.
Implement the following methods as needed:
apply: mandatory, it implements the functionality of the rule
has_input: defaults to True, override and return False if your rule reads data
into the plan and therefore has no other dataframe input
has_output: defaults to True, override and return False if your rule writes data
to a persistent repository and therefore has no dataframe output
get_all_named_inputs: override to return the named inputs (if any) as strings
get_all_named_outputs: override in case of multiple named outputs and return them as strings
named_output (Optional[str]): Give the output of this rule a name so it can be used by another rule as a named input. Optional.
When not set, the result of this rule will be available as the main output.
When set to a name (string), the result will be available as that named output.
name (Optional[str]): Give the rule a name. Optional.
Named rules are more descriptive as to what they're trying to do/the intent.
description (Optional[str]): Describe in detail what the rules does, how it does it. Optional.
Together with the name, the description acts as the documentation of the rule.
strict (bool): When set to True, the rule does a stricter valiation. Default: True
Note:
Add any class data members to the following list/tuples if needed:
EXCLUDE_FROM_COMPARE: Used in implementing equality between rules. Equality is
mostly used in tests. By default, equality looks at all data members in the
class' __dict__. You can exclude calculated or transient data members which
should be excluded from equality. Alternatively, you can implement __eq__ in
your own class and not rely on the __eq__ implementation in the base class.
EXCLUDE_FROM_SERIALIZE: Used to exclude data members from the serialization to
dict and yaml. The serialization is implemented generically in the base class
to serialize all data members in the class' __dict__ which do not start with
an underscore. See the note on serialization below.
Note:
When implementing serialization, the arguments into your class should be saved as
they are in data members with the same name as the arguments. This is because the
de-serialization passes those as args into the __init__. As such, make sure to use
the same names and to exclude data members which are not in the __init__ from
serialization by adding them to EXCLUDE_FROM_SERIALIZE.
"""
EXCLUDE_FROM_COMPARE = ()
EXCLUDE_FROM_SERIALIZE = ()
def __init__(self, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
assert named_output is None or isinstance(named_output, str) and named_output
self.named_output = named_output
self.name = name
self.description = description
self.strict = strict
def get_name(self) -> Optional[str]:
""" Returns the name of the rule.
The name is optional and it can be None.
The name of the rule should indicate what the rule does and not how it's
implemented. The names should read like documentation. As such, names like
Remove duplicate first names from the list of addresses
Only keep the names starting with A
are preferable names to:
DedupeRule
ProjectRule
Names are not used internally for anything other than your own (and your
end users') documentation, so use what makes sense.
"""
return self.name
def get_description(self) -> Optional[str]:
""" A long description of what the rule does, why and optionally how it does it.
The description is optional and it can be None.
Similar to name, this long description acts as documentation for you and your users.
It's particularly useful if your rule is serialized in a readable format like yaml
and your users either do not have access to the documentation or they are not technical.
Unlike the name, which should generally be a single line headline, the description is a
long, multi-line description of the rule: the what, why, how of the rule.
"""
return self.description
def has_input(self) -> bool:
""" Returns True if the rule needs a dataframe input to operate on, False otherwise.
By default, it returns True. It should be overriden to return False for those
rules which read data into the plan. For example, reading a csv file or reading a
table from the DB. These are operation which do not need an input dataframe to
operate on as they are sourcing data.
"""
return True
def has_output(self) -> bool:
""" Returns True if the rule produces a dataframe, False otherwise.
By default, it returns True. It should be overriden to return False for those
rules which write data out of the plan. For example, writing a file or data into a
database. These are operations which do not produce an output dataframe into
the plan as they are writing data outside the plan.
"""
return True
def has_named_output(self) -> bool:
return bool(self.named_output)
def get_all_named_inputs(self) -> Generator[str, None, None]:
""" Yields all the named inputs of this rule (as strings).
By default, it yields nothing as this base rule doesn't store
information about inputs. Some rules take no input, some take
one or more inputs. Yield accordingly when you override.
"""
yield from ()
def get_all_named_outputs(self) -> Generator[str, None, None]:
""" Yields all the named outputs of this rule (as strings).
By default, it yields the single named_output passed into this
rule as an argument. Some rules produce no output, some produce
multiple outputs. Yield accordingly when you override.
"""
yield self.named_output
def _set_output_df(self, data, df):
if self.named_output is None:
data.set_main_output(df)
else:
data.set_named_output(self.named_output, df)
def apply(self, data: RuleData) -> None:
""" Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data.
The input data is an instance of RuleData which can store a single, unnamed
dataframe (in pipeline mode) or one or many named dataframes (in graph mode).
The rule extracts the data it needs from the data, applies its main logic
and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that
the data passed in is an instance of RuleData. Override this when you derive
from BaseRule and implement the logic of your rule.
"""
assert isinstance(data, RuleData)
def to_dict(self) -> dict:
""" Serializes this rule to a python dictionary.
This is a generic implementation that should work for all derived
classes and therefore you shouldn't need to override, although you can do so.
Because it aims to be generic and work correctly for all the derived classes,
a few assumptions are made and must be respected when you implement your own
rules derived from BaseRule.
The class will serialize all the data attributes of a class which do not start with
underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member
of the class. As such, to exclude any of your internal data attributes, either named
them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.
The serialize will look into a classes __dict__ and therefore the class must have a
__dict__.
For the de-serialization to work generically, the name of the attributes must match the
names of the arguments in the __init__. This is quite an important and restrictive
constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.
Note:
Use the same name for attributes on self as the respective arguments in __init__.
"""
dct = {
"name": self.name,
"description": self.description,
}
dct.update({
attr: value for attr, value in self.__dict__.items()
if not attr.startswith("_") and attr not in self.EXCLUDE_FROM_SERIALIZE
and attr not in dct.keys()
})
return {
self.__class__.__name__: dct
}
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
""" Creates a rule instance from a python dictionary.
Args:
dct: A dictionary to create the plan from
backend: One of the supported backends (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
assert backend and isinstance(backend, str)
keys = tuple(dct.keys())
assert len(keys) == 1
rule_name = keys[0]
backend_pkgs = [f'etlrules.backends.{backend}']
for additional_package in additional_packages or ():
backend_pkgs.append(additional_package)
modules = [importlib.import_module(backend_pkg, '') for backend_pkg in backend_pkgs]
for mod in modules:
clss = getattr(mod, rule_name, None)
if clss is not None:
break
assert clss, f"Cannot find class {rule_name} in packages: {backend_pkgs}"
if clss is not cls:
return clss.from_dict(dct, backend, additional_packages)
return clss(**dct[rule_name])
def to_yaml(self):
""" Serialize the rule to yaml. """
return yaml.safe_dump(self.to_dict())
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
""" Creates a rule instance from a yaml definition.
Args:
yml: The yaml string to create the plan from
backend: A supported backend (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a rule.
"""
dct = yaml.safe_load(yml)
return cls.from_dict(dct, backend, additional_packages)
def __eq__(self, other) -> bool:
return (
type(self) == type(other) and
{k: v for k, v in self.__dict__.items() if k not in self.EXCLUDE_FROM_COMPARE} ==
{k: v for k, v in other.__dict__.items() if k not in self.EXCLUDE_FROM_COMPARE}
)
apply(self, data)
¶
Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data. The input data is an instance of RuleData which can store a single, unnamed dataframe (in pipeline mode) or one or many named dataframes (in graph mode). The rule extracts the data it needs from the data, applies its main logic and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that the data passed in is an instance of RuleData. Override this when you derive from BaseRule and implement the logic of your rule.
Source code in etlrules/rule.py
def apply(self, data: RuleData) -> None:
""" Applies the main rule logic to the input data.
This is the main rule that applies a rule logic to an input data.
The input data is an instance of RuleData which can store a single, unnamed
dataframe (in pipeline mode) or one or many named dataframes (in graph mode).
The rule extracts the data it needs from the data, applies its main logic
and updates the same instance of RuleData with the output, if any.
This method doesn't do anything in the base class other than asserting that
the data passed in is an instance of RuleData. Override this when you derive
from BaseRule and implement the logic of your rule.
"""
assert isinstance(data, RuleData)
from_dict(dct, backend, additional_packages=None)
classmethod
¶
Creates a rule instance from a python dictionary.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
dct |
dict |
A dictionary to create the plan from |
required |
backend |
str |
One of the supported backends (ie pandas) |
required |
additional_packages |
Optional[Sequence[str]] |
Optional list of other packages to look for rules in |
None |
Returns:
Type | Description |
---|---|
BaseRule |
A new instance of a Plan. |
Source code in etlrules/rule.py
@classmethod
def from_dict(cls, dct: dict, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
""" Creates a rule instance from a python dictionary.
Args:
dct: A dictionary to create the plan from
backend: One of the supported backends (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a Plan.
"""
assert backend and isinstance(backend, str)
keys = tuple(dct.keys())
assert len(keys) == 1
rule_name = keys[0]
backend_pkgs = [f'etlrules.backends.{backend}']
for additional_package in additional_packages or ():
backend_pkgs.append(additional_package)
modules = [importlib.import_module(backend_pkg, '') for backend_pkg in backend_pkgs]
for mod in modules:
clss = getattr(mod, rule_name, None)
if clss is not None:
break
assert clss, f"Cannot find class {rule_name} in packages: {backend_pkgs}"
if clss is not cls:
return clss.from_dict(dct, backend, additional_packages)
return clss(**dct[rule_name])
from_yaml(yml, backend, additional_packages=None)
classmethod
¶
Creates a rule instance from a yaml definition.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
yml |
str |
The yaml string to create the plan from |
required |
backend |
str |
A supported backend (ie pandas) |
required |
additional_packages |
Optional[Sequence[str]] |
Optional list of other packages to look for rules in |
None |
Returns:
Type | Description |
---|---|
BaseRule |
A new instance of a rule. |
Source code in etlrules/rule.py
@classmethod
def from_yaml(cls, yml: str, backend: str, additional_packages: Optional[Sequence[str]]=None) -> 'BaseRule':
""" Creates a rule instance from a yaml definition.
Args:
yml: The yaml string to create the plan from
backend: A supported backend (ie pandas)
additional_packages: Optional list of other packages to look for rules in
Returns:
A new instance of a rule.
"""
dct = yaml.safe_load(yml)
return cls.from_dict(dct, backend, additional_packages)
get_all_named_inputs(self)
¶
Yields all the named inputs of this rule (as strings).
By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.
Source code in etlrules/rule.py
def get_all_named_inputs(self) -> Generator[str, None, None]:
""" Yields all the named inputs of this rule (as strings).
By default, it yields nothing as this base rule doesn't store
information about inputs. Some rules take no input, some take
one or more inputs. Yield accordingly when you override.
"""
yield from ()
get_all_named_outputs(self)
¶
Yields all the named outputs of this rule (as strings).
By default, it yields the single named_output passed into this rule as an argument. Some rules produce no output, some produce multiple outputs. Yield accordingly when you override.
Source code in etlrules/rule.py
def get_all_named_outputs(self) -> Generator[str, None, None]:
""" Yields all the named outputs of this rule (as strings).
By default, it yields the single named_output passed into this
rule as an argument. Some rules produce no output, some produce
multiple outputs. Yield accordingly when you override.
"""
yield self.named_output
get_description(self)
¶
A long description of what the rule does, why and optionally how it does it.
The description is optional and it can be None.
Similar to name, this long description acts as documentation for you and your users. It's particularly useful if your rule is serialized in a readable format like yaml and your users either do not have access to the documentation or they are not technical.
Unlike the name, which should generally be a single line headline, the description is a long, multi-line description of the rule: the what, why, how of the rule.
Source code in etlrules/rule.py
def get_description(self) -> Optional[str]:
""" A long description of what the rule does, why and optionally how it does it.
The description is optional and it can be None.
Similar to name, this long description acts as documentation for you and your users.
It's particularly useful if your rule is serialized in a readable format like yaml
and your users either do not have access to the documentation or they are not technical.
Unlike the name, which should generally be a single line headline, the description is a
long, multi-line description of the rule: the what, why, how of the rule.
"""
return self.description
get_name(self)
¶
Returns the name of the rule.
The name is optional and it can be None.
The name of the rule should indicate what the rule does and not how it's implemented. The names should read like documentation. As such, names like
Remove duplicate first names from the list of addresses Only keep the names starting with A
are preferable names to:
DedupeRule ProjectRule
Names are not used internally for anything other than your own (and your end users') documentation, so use what makes sense.
Source code in etlrules/rule.py
def get_name(self) -> Optional[str]:
""" Returns the name of the rule.
The name is optional and it can be None.
The name of the rule should indicate what the rule does and not how it's
implemented. The names should read like documentation. As such, names like
Remove duplicate first names from the list of addresses
Only keep the names starting with A
are preferable names to:
DedupeRule
ProjectRule
Names are not used internally for anything other than your own (and your
end users') documentation, so use what makes sense.
"""
return self.name
has_input(self)
¶
Returns True if the rule needs a dataframe input to operate on, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which read data into the plan. For example, reading a csv file or reading a table from the DB. These are operation which do not need an input dataframe to operate on as they are sourcing data.
Source code in etlrules/rule.py
def has_input(self) -> bool:
""" Returns True if the rule needs a dataframe input to operate on, False otherwise.
By default, it returns True. It should be overriden to return False for those
rules which read data into the plan. For example, reading a csv file or reading a
table from the DB. These are operation which do not need an input dataframe to
operate on as they are sourcing data.
"""
return True
has_output(self)
¶
Returns True if the rule produces a dataframe, False otherwise.
By default, it returns True. It should be overriden to return False for those rules which write data out of the plan. For example, writing a file or data into a database. These are operations which do not produce an output dataframe into the plan as they are writing data outside the plan.
Source code in etlrules/rule.py
def has_output(self) -> bool:
""" Returns True if the rule produces a dataframe, False otherwise.
By default, it returns True. It should be overriden to return False for those
rules which write data out of the plan. For example, writing a file or data into a
database. These are operations which do not produce an output dataframe into
the plan as they are writing data outside the plan.
"""
return True
to_dict(self)
¶
Serializes this rule to a python dictionary.
This is a generic implementation that should work for all derived classes and therefore you shouldn't need to override, although you can do so.
Because it aims to be generic and work correctly for all the derived classes, a few assumptions are made and must be respected when you implement your own rules derived from BaseRule.
The class will serialize all the data attributes of a class which do not start with underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member of the class. As such, to exclude any of your internal data attributes, either named them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.
The serialize will look into a classes dict and therefore the class must have a dict.
For the de-serialization to work generically, the name of the attributes must match the names of the arguments in the init. This is quite an important and restrictive constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.
Note
Use the same name for attributes on self as the respective arguments in init.
Source code in etlrules/rule.py
def to_dict(self) -> dict:
""" Serializes this rule to a python dictionary.
This is a generic implementation that should work for all derived
classes and therefore you shouldn't need to override, although you can do so.
Because it aims to be generic and work correctly for all the derived classes,
a few assumptions are made and must be respected when you implement your own
rules derived from BaseRule.
The class will serialize all the data attributes of a class which do not start with
underscore and are not explicitly listed in the EXCLUDE_FROM_SERIALIZE static member
of the class. As such, to exclude any of your internal data attributes, either named
them so they start with an underscore or add them explicitly to EXCLUDE_FROM_SERIALIZE.
The serialize will look into a classes __dict__ and therefore the class must have a
__dict__.
For the de-serialization to work generically, the name of the attributes must match the
names of the arguments in the __init__. This is quite an important and restrictive
constraint which is needed to avoid forcing every rule to implement a serialize/deserialize.
Note:
Use the same name for attributes on self as the respective arguments in __init__.
"""
dct = {
"name": self.name,
"description": self.description,
}
dct.update({
attr: value for attr, value in self.__dict__.items()
if not attr.startswith("_") and attr not in self.EXCLUDE_FROM_SERIALIZE
and attr not in dct.keys()
})
return {
self.__class__.__name__: dct
}
to_yaml(self)
¶
Serialize the rule to yaml.
Source code in etlrules/rule.py
def to_yaml(self):
""" Serialize the rule to yaml. """
return yaml.safe_dump(self.to_dict())
BinaryOpBaseRule (BaseRule)
¶
Base class for binary operation rules (ie operations taking two data frames as input).
Source code in etlrules/rule.py
class BinaryOpBaseRule(BaseRule):
""" Base class for binary operation rules (ie operations taking two data frames as input). """
def __init__(self, named_input_left: Optional[str], named_input_right: Optional[str], named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_output=named_output, name=name, description=description, strict=strict)
assert named_input_left is None or isinstance(named_input_left, str) and named_input_left
assert named_input_right is None or isinstance(named_input_right, str) and named_input_right
self.named_input_left = named_input_left
self.named_input_right = named_input_right
def _get_input_df_left(self, data: RuleData):
if self.named_input_left is None:
return data.get_main_output()
return data.get_named_output(self.named_input_left)
def _get_input_df_right(self, data: RuleData):
if self.named_input_right is None:
return data.get_main_output()
return data.get_named_output(self.named_input_right)
def get_all_named_inputs(self):
yield self.named_input_left
yield self.named_input_right
get_all_named_inputs(self)
¶
Yields all the named inputs of this rule (as strings).
By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.
Source code in etlrules/rule.py
def get_all_named_inputs(self):
yield self.named_input_left
yield self.named_input_right
UnaryOpBaseRule (BaseRule)
¶
Base class for unary operation rules (ie operations taking a single data frame as input).
Source code in etlrules/rule.py
class UnaryOpBaseRule(BaseRule):
""" Base class for unary operation rules (ie operations taking a single data frame as input). """
def __init__(self, named_input: Optional[str]=None, named_output: Optional[str]=None, name: Optional[str]=None, description: Optional[str]=None, strict: bool=True):
super().__init__(named_output=named_output, name=name, description=description, strict=strict)
assert named_input is None or isinstance(named_input, str) and named_input
self.named_input = named_input
def _get_input_df(self, data: RuleData):
if self.named_input is None:
return data.get_main_output()
return data.get_named_output(self.named_input)
def get_all_named_inputs(self):
yield self.named_input
get_all_named_inputs(self)
¶
Yields all the named inputs of this rule (as strings).
By default, it yields nothing as this base rule doesn't store information about inputs. Some rules take no input, some take one or more inputs. Yield accordingly when you override.
Source code in etlrules/rule.py
def get_all_named_inputs(self):
yield self.named_input
runner
¶
load_plan(plan_file, backend)
¶
Load a plan from a yaml file.
Basic usage:
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
plan_file |
str |
A path to a yaml file with the plan definition |
required |
backend |
str |
One of the supported backends (e.g. pandas, polars, etc.) |
required |
Returns:
Type | Description |
---|---|
Plan |
A Plan instance deserialized from the yaml file. |
Source code in etlrules/runner.py
def load_plan(plan_file: str, backend: str) -> Plan:
""" Load a plan from a yaml file.
Basic usage:
from etlrules import load_plan
plan = load_plan("/home/someuser/some_plan.yml", "pandas")
Args:
plan_file: A path to a yaml file with the plan definition
backend: One of the supported backends (e.g. pandas, polars, etc.)
Returns:
A Plan instance deserialized from the yaml file.
"""
with open(plan_file, 'rt') as plan_f:
contents = plan_f.read()
return Plan.from_yaml(contents, backend)
run_plan(plan_file, backend)
¶
Runs a plan from a yaml file with a given backend.
The backend referers to the underlying dataframe library used to run the plan.
Basic usage:
1 2 |
|
Parameters:
Name | Type | Description | Default |
---|---|---|---|
plan_file |
str |
A path to a yaml file with the plan definition |
required |
backend |
str |
One of the supported backends |
required |
Note
The supported backends: pandas, polars, dask (work in progress)
Returns:
Type | Description |
---|---|
RuleData |
A RuleData instance which contains the result dataframe(s). |
Source code in etlrules/runner.py
def run_plan(plan_file: str, backend: str) -> RuleData:
""" Runs a plan from a yaml file with a given backend.
The backend referers to the underlying dataframe library used to run
the plan.
Basic usage:
from etlrules import run_plan
data = run_plan("/home/someuser/some_plan.yml", "pandas")
Args:
plan_file: A path to a yaml file with the plan definition
backend: One of the supported backends
Note:
The supported backends:
pandas, polars, dask (work in progress)
Returns:
A RuleData instance which contains the result dataframe(s).
"""
plan = load_plan(plan_file, backend)
args = get_args_parser(plan)
context = {}
context.update(args)
etlrules_tempdir, etlrules_tempdir_cleanup = get_etlrules_temp_dir()
context.update({
"etlrules_tempdir": etlrules_tempdir,
"etlrules_tempdir_cleanup": etlrules_tempdir_cleanup,
})
try:
data = RuleData(context=context)
engine = RuleEngine(plan)
engine.run(data)
finally:
if etlrules_tempdir_cleanup:
shutil.rmtree(etlrules_tempdir)
return data