Components Definition

Talend Component framework relies on several primitive components.

They can all use @PostConstruct and @PreDestroy to initialize/release some underlying resource at the beginning/end of the processing.

in distributed environments class' constructor will be called on cluster manager node, methods annotated with @PostConstruct and @PreDestroy annotations will be called on worker nodes. Thus, partition plan computation and pipeline task will be performed on different nodes.

deployment diagram

  1. Created task consists of Jar file, containing class, which describes pipeline(flow) which should be processed in cluster.

  2. During partition plan computation step pipeline is analyzed and split into stages. Cluster Manager node instantiates mappers/processors gets estimated data size using mappers, splits created mappers according to the estimated data size. All instances are serialized and sent to Worker nodes afterwards.

  3. Serialized instances are received and deserialized, methods annotated with @PostConstruct annotation are called. After that, pipeline execution is started. Processor’s @BeforeGroup annotated method is called before processing first element in chunk. After processing number of records estimated as chunk size, Processor’s @AfterGroup annotated method called. Chunk size is calculated depending on environment the pipeline is processed by. After pipeline is processed, methods annotated with @PreDestroy annotation are called.

driver processing workflow

worker processing workflow

all framework managed methods MUST be public too. Private methods are ignored.
in term of design the framework tries to be as declarative as possible but also to stay extensible not using fixed interfaces or method signatures. This will allow to add incrementally new features of the underlying implementations.

PartitionMapper

A PartitionMapper is a component able to split itself to make the execution more efficient.

This concept is borrowed to big data world and useful only in this context (BEAM executions). Overall idea is to divide the work before executing it to try to reduce the overall execution time.

The process is the following:

  1. Estimate the size of the data you will work on. This part is often heuristic and not very precise.

  2. From that size the execution engine (runner for beam) will request the mapper to split itself in N mappers with a subset of the overall work.

  3. The leaf (final) mappers will be used as a Producer (actual reader) factory.

this kind of component MUST be Serializable to be distributable.

Definition

A partition mapper requires 3 methods marked with specific annotations:

  1. @Assessor for the evaluating method

  2. @Split for the dividing method

  3. @Emitter for the Producer factory

@Assessor

The assessor method will return the estimated size of the data related to the component (depending its configuration). It MUST return a Number and MUST not take any parameter.

Here is an example:

@Assessor
public long estimateDataSetByteSize() {
    return ....;
}

@Split

The split method will return a collection of partition mappers and can take optionally a @PartitionSize long value which is the requested size of the dataset per sub partition mapper.

Here is an example:

@Split
public List<MyMapper> split(@PartitionSize final long desiredSize) {
    return ....;
}

@Emitter

The emitter method MUST not have any parameter and MUST return a producer. It generally uses the partition mapper configuration to instantiate/configure the producer.

Here is an example:

@Emitter
public MyProducer create() {
    return ....;
}

Producer

Producer is the component interacting with a physical source. It produces input data for the processing flow.

A producer is a very simple component which MUST have a @Producer method without any parameter and returning any data:

@Producer
public MyData produces() {
    return ...;
}

Processor

A Processor is a component responsible to convert an incoming data to another model.

A processor MUST have a method decorated with @ElementListener taking an incoming data and returning the processed data:

@ElementListener
public MyNewData map(final MyData data) {
    return ...;
}
this kind of component MUST be Serializable since it is distributed.
if you don’t care much of the type of the parameter and need to access data on a "map like" based rule set, then you can use JsonObject as parameter type and Talend Component will just wrap the data to enable you to access it as a map. The parameter type is not enforced, i.e. if you know you will get a SuperCustomDto then you can use that as parameter type but for generic component reusable in any chain it is more than highly encouraged to use JsonObject until you have your an evaluation language based processor (which has its own way to access component). Here is an example:
@ElementListener
public MyNewData map(final JsonObject incomingData) {
    String name = incomingData.getString("name");
    int name = incomingData.getInt("age");
    return ...;
}

// equivalent to (using POJO subclassing)

public class Person {
    private String age;
    private int age;

    // getters/setters
}

@ElementListener
public MyNewData map(final Person person) {
    String name = person.getName();
    int name = person.getAge();
    return ...;
}

A processor also supports @BeforeGroup and @AfterGroup which MUST be methods without parameters and returning void (result would be ignored). This is used by the runtime to mark a chunk of the data in a way which is estimated good for the execution flow size.

this is estimated so you don’t have any guarantee on the size of a group. You can literally have groups of size 1.

The common usage is to batch records for performance reasons:

@BeforeGroup
public void initBatch() {
    // ...
}

@AfterGroup
public void endBatch() {
    // ...
}
it is a good practise to support a maxBatchSize here and potentially commit before the end of the group in case of a computed size which is way too big for your backend.

Multiple outputs

In some case you may want to split the output of a processor in two. A common example is "main" and "reject" branches where part of the incoming data are put in a specific bucket to be processed later.

This can be done using @Output. This can be used as a replacement of the returned value:

@ElementListener
public void map(final MyData data, @Output final OutputEmitter<MyNewData> output) {
    output.emit(createNewData(data));
}

Or you can pass it a string which will represent the new branch:

@ElementListener
public void map(final MyData data,
                @Output final OutputEmitter<MyNewData> main,
                @Output("rejected") final OutputEmitter<MyNewDataWithError> rejected) {
    if (isRejected(data)) {
        rejected.emit(createNewData(data));
    } else {
        main.emit(createNewData(data));
    }
}

// or simply

@ElementListener
public MyNewData map(final MyData data,
                    @Output("rejected") final OutputEmitter<MyNewDataWithError> rejected) {
    if (isSuspicious(data)) {
        rejected.emit(createNewData(data));
        return createNewData(data); // in this case we continue the processing anyway but notified another channel
    }
    return createNewData(data);
}

Multiple inputs

Having multiple inputs is closeto the output case excep it doesn’t require a wrapper OutputEmitter:

@ElementListener
public MyNewData map(@Input final MyData data, @Input("input2") final MyData2 data2) {
    return createNewData(data1, data2);
}

@Input takes the input name as parameter, if not set it uses the main (default) input branch.

due to the work required to not use the default branch it is recommended to use it when possible and not name its branches depending on the component semantic.

Output

An Output is a Processor returning no data.

Conceptually an output is a listener of data. It perfectly matches the concept of processor. Being the last of the execution chain or returning no data will make your processor an output:

@ElementListener
public void store(final MyData data) {
    // ...
}

Combiners?

For now Talend Component doesn’t enable you to define a Combiner. It would be the symmetric part of the partition mapper and allow to aggregate results in a single one.

Scroll to top