What is a partition mapper
A PartitionMapper
is a component able to split itself to make the execution more efficient.
This concept is borrowed from big data and useful in this context only (BEAM
executions).
The idea is to divide the work before executing it in order to reduce the overall execution time.
The process is the following:
-
The size of the data you work on is estimated. This part can be heuristic and not very precise.
-
From that size, the execution engine (runner for Beam) requests the mapper to split itself in N mappers with a subset of the overall work.
-
The leaf (final) mapper is used as a
Producer
(actual reader) factory.
This kind of component must be Serializable to be distributable.
|
Implementing a partition mapper
A partition mapper requires three methods marked with specific annotations:
-
@Assessor
for the evaluating method -
@Split
for the dividing method -
@Emitter
for theProducer
factory
@Assessor
The Assessor method returns the estimated size of the data related to the component (depending its configuration).
It must return a Number
and must not take any parameter.
For example:
@Assessor
public long estimateDataSetByteSize() {
return ....;
}
@Split
The Split method returns a collection of partition mappers and can take optionally a @PartitionSize
long value as parameter, which is the requested size of the dataset per sub partition mapper.
For example:
@Split
public List<MyMapper> split(@PartitionSize final long desiredSize) {
return ....;
}