Component definition

Talend Component Kit framework relies on several primitive components.

All components can use @PostConstruct and @PreDestroy annotations to initialize or release some underlying resource at the beginning and the end of a processing.

In distributed environments, class constructor are called on cluster manager node. Methods annotated with @PostConstruct and @PreDestroy are called on worker nodes. Thus, partition plan computation and pipeline tasks are performed on different nodes.

deployment diagram

  1. The created task is a JAR file containing class information, which describes the pipeline (flow) that should be processed in cluster.

  2. During the partition plan computation step, the pipeline is analyzed and split into stages. The cluster manager node instantiates mappers/processors, gets estimated data size using mappers, and splits created mappers according to the estimated data size.
    All instances are then serialized and sent to the worker node.

  3. Serialized instances are received and deserialized. Methods annotated with @PostConstruct are called. After that, pipeline execution starts. The @BeforeGroup annotated method of the processor is called before processing the first element in chunk.
    After processing the number of records estimated as chunk size, the @AfterGroup annotated method of the processor is called. Chunk size is calculated depending on the environment the pipeline is processed by. Once the pipeline is processed, methods annotated with @PreDestroy are called.

All the methods managed by the framework must be public. Private methods are ignored.

driver processing workflow

worker processing workflow

The framework is designed to be as declarative as possible but also to stay extensible by not using fixed interfaces or method signatures. This allows to incrementally add new features of the underlying implementations.

PartitionMapper

A PartitionMapper is a component able to split itself to make the execution more efficient.

This concept is borrowed from big data and useful in this context only (BEAM executions). The idea is to divide the work before executing it in order to reduce the overall execution time.

The process is the following:

  1. The size of the data you work on is estimated. This part can be heuristic and not very precise.

  2. From that size, the execution engine (runner for Beam) requests the mapper to split itself in N mappers with a subset of the overall work.

  3. The leaf (final) mapper is used as a Producer (actual reader) factory.

This kind of component must be Serializable to be distributable.

Definition

A partition mapper requires three methods marked with specific annotations:

  1. @Assessor for the evaluating method

  2. @Split for the dividing method

  3. @Emitter for the Producer factory

@Assessor

The Assessor method returns the estimated size of the data related to the component (depending its configuration). It must return a Number and must not take any parameter.

For example:

@Assessor
public long estimateDataSetByteSize() {
    return ....;
}

@Split

The Split method returns a collection of partition mappers and can take optionally a @PartitionSize long value as parameter, which is the requested size of the dataset per sub partition mapper.

For example:

@Split
public List<MyMapper> split(@PartitionSize final long desiredSize) {
    return ....;
}

@Emitter

The Emitter method must not have any parameter and must return a producer. It uses the partition mapper configuration to instantiate and configure the producer.

For example:

@Emitter
public MyProducer create() {
    return ....;
}

Producer

Producer is a component interacting with a physical source. It produces input data for the processing flow.

A producer is a simple component that must have a @Producer method without any parameter. It can return any data:

@Producer
public MyData produces() {
    return ...;
}

Processor

A Processor is a component that converts incoming data to a different model.

A processor must have a method decorated with @ElementListener taking an incoming data and returning the processed data:

@ElementListener
public MyNewData map(final MyData data) {
    return ...;
}

Processors must be Serializable because they are distributed components.

If you just need to access data on a map-based ruleset, you can use JsonObject as parameter type.
From there, Talend Component Kit wraps the data to allow you to access it as a map. The parameter type is not enforced.
This means that if you know you will get a SuperCustomDto, then you can use it as parameter type. But for generic components that are reusable in any chain, it is highly encouraged to use JsonObject until you have an evaluation language-based processor that has its own way to access components.

For example:

@ElementListener
public MyNewData map(final JsonObject incomingData) {
    String name = incomingData.getString("name");
    int name = incomingData.getInt("age");
    return ...;
}

// equivalent to (using POJO subclassing)

public class Person {
    private String age;
    private int age;

    // getters/setters
}

@ElementListener
public MyNewData map(final Person person) {
    String name = person.getName();
    int age = person.getAge();
    return ...;
}

A processor also supports @BeforeGroup and @AfterGroup methods, which must not have any parameter and return void values. Any other result would be ignored. These methods are used by the runtime to mark a chunk of the data in a way which is estimated good for the execution flow size.

Because the size is estimated, the size of a group can vary. It is even possible to have groups of size 1.

It is recommended to batch records, for performance reasons:

@BeforeGroup
public void initBatch() {
    // ...
}

@AfterGroup
public void endBatch() {
    // ...
}
It is a good practice to support a maxBatchSize here and to commit before the end of the group, in case of a computed size that is way too big for your backend to handle.

Multiple outputs

In some cases, you may need to split the output of a processor in two. A common example is to have "main" and "reject" branches where part of the incoming data are passed to a specific bucket to be processed later.

To do that, you can use @Output as replacement of the returned value:

@ElementListener
public void map(final MyData data, @Output final OutputEmitter<MyNewData> output) {
    output.emit(createNewData(data));
}

Alternatively, you can pass a string that represents the new branch:

@ElementListener
public void map(final MyData data,
                @Output final OutputEmitter<MyNewData> main,
                @Output("rejected") final OutputEmitter<MyNewDataWithError> rejected) {
    if (isRejected(data)) {
        rejected.emit(createNewData(data));
    } else {
        main.emit(createNewData(data));
    }
}

// or

@ElementListener
public MyNewData map(final MyData data,
                    @Output("rejected") final OutputEmitter<MyNewDataWithError> rejected) {
    if (isSuspicious(data)) {
        rejected.emit(createNewData(data));
        return createNewData(data); // in this case the processing continues but notifies another channel
    }
    return createNewData(data);
}

Multiple inputs

Having multiple inputs is similar to having multiple outputs, except that an OutputEmitter wrapper is not needed:

@ElementListener
public MyNewData map(@Input final MyData data, @Input("input2") final MyData2 data2) {
    return createNewData(data1, data2);
}

@Input takes the input name as parameter. If no name is set, it defaults to the "main (default)" input branch. It is recommended to use the default branch when possible and to avoid naming branches according to the component semantic.

Output

An Output is a Processor that does not return any data.

Conceptually, an output is a data listener. It matches the concept of processor. Being the last component of the execution chain or returning no data makes your processor an output component:

@ElementListener
public void store(final MyData data) {
    // ...
}

Combiners

Currently, Talend Component Kit does not allow you to define a Combiner. A combiner is the symmetric part of a partition mapper and allows to aggregate results in a single partition.

Family and component icons

Every component family and component needs to have a representative icon.
You can use one of the icons provided by the framework or you can use a custom icon.

For the component family the icon is defined in the package-info.java file. For the component itself, you need to declare it in the component class.

To use a custom icon, you need to have the icon file placed in the resources/icons folder of the project. The icon file needs to have a name following the convention IconName_icon32.png, where you can replace IconName by the name of your choice.

@Icon(value = Icon.IconType.CUSTOM, custom = "IconName")
Scroll to top