Table of Contents
Job Configuration Basics
A Job configuration file is a text file with extension .pull
or .job
that defines the job properties that can be loaded into a Java Properties object. Gobblin uses commons-configuration to allow variable substitutions in job configuration files. You can find some example Gobblin job configuration files here.
A Job configuration file typically includes the following properties, in additional to any mandatory configuration properties required by the custom Gobblin Constructs classes. For a complete reference of all configuration properties supported by Gobblin, please refer to Configuration Properties Glossary.
job.name
: job name.job.group
: the group the job belongs to.source.class
: theSource
class the job uses.converter.classes
: a comma-separated list ofConverter
classes to use in the job. This property is optional.- Quality checker related configuration properties: a Gobblin job typically has both row-level and task-level quality checkers specified. Please refer to Quality Checker Properties for configuration properties related to quality checkers.
Hierarchical Structure of Job Configuration Files
It is often the case that a Gobblin instance runs many jobs and manages the job configuration files corresponding to those jobs. The jobs may belong to different job groups and are for different data sources. It is also highly likely that jobs for the same data source shares a lot of common properties. So it is very useful to support the following features:
- Job configuration files can be grouped by the job groups they belong to and put into different subdirectories under the root job configuration file directory.
- Common job properties shared among multiple jobs can be extracted out to a common properties file that will be applied into the job configurations of all these jobs.
Gobblin supports the above features using a hierarchical structure to organize job configuration files under the root job configuration file directory. The basic idea is that there can be arbitrarily deep nesting of subdirectories under the root job configuration file directory. Each directory regardless how deep it is can have a single .properties
file storing common properties that will be included when loading the job configuration files under the same directory or in any subdirectories. Below is an example directory structure.
root_job_config_dir/
common.properties
foo/
foo1.job
foo2.job
foo.properties
bar/
bar1.job
bar2.job
bar.properties
baz/
baz1.pull
baz2.pull
baz.properties
In this example, common.properties
will be included when loading foo1.job
, foo2.job
, bar1.job
, bar2.job
, baz1.pull
, and baz2.pull
. foo.properties
will be included when loading foo1.job
and foo2.job
and properties set here are considered more special and will overwrite the same properties defined in common.properties
. Similarly, bar.properties
will be included when loading bar1.job
and bar2.job
, as well as baz1.pull
and baz2.pull
. baz.properties
will be included when loading baz1.pull
and baz2.pull
and will overwrite the same properties defined in bar.properties
and common.properties
.
Password Encryption
To avoid storing passwords in configuration files in plain text, Gobblin supports encryption of the password configuration properties. All such properties can be encrypted (and decrypted) using a master password. The master password is stored in a file available at runtime. The file can be on a local file system or HDFS and has restricted access.
The URI of the master password file is controlled by the configuration option encrypt.key.loc
. By default, Gobblin will use org.jasypt.util.password.BasicPasswordEncryptor. If you have installed the JCE Unlimited Strength Policy, you can set
encrypt.use.strong.encryptor=true
which will configure Gobblin to use org.jasypt.util.password.StrongPasswordEncryptor.
Encrypted passwords can be generated using the CLIPasswordEncryptor
tool.
$ gradle :gobblin-utility:assemble
$ cd build/gobblin-utility/distributions/
$ tar -zxf gobblin-utility.tar.gz
$ bin/gobblin_password_encryptor.sh
usage:
-f <master password file> file that contains the master password used
to encrypt the plain password
-h print this message
-m <master password> master password used to encrypt the plain
password
-p <plain password> plain password to be encrypted
-s use strong encryptor
$ bin/gobblin_password_encryptor.sh -m Hello -p Bye
ENC(AQWoQ2Ybe8KXDXwPOA1Ziw==)
If you are extending Gobblin and you want some of your configurations (e.g. the ones containing credentials) to support encryption, you can use gobblin.password.PasswordManager.getInstance()
methods to get an instance of PasswordManager
. You can then use PasswordManager.readPassword(String)
which will transparently decrypt the value if needed, i.e. if it is in the form ENC(...)
and a master password is provided.
Adding or Changing Job Configuration Files
The Gobblin job scheduler in the standalone deployment monitors any changes to the job configuration file directory and reloads any new or updated job configuration files when detected. This allows adding new job configuration files or making changes to existing ones without bringing down the standalone instance. Currently, the following types of changes are monitored and supported:
- Adding a new job configuration file with a
.job
or.pull
extension. The new job configuration file is loaded once it is detected. In the example hierarchical structure above, if a new job configuration filebaz3.pull
is added underbar/baz
, it is loaded with properties included fromcommon.properties
,bar.properties
, andbaz.properties
in that order. - Changing an existing job configuration file with a
.job
or.pull
extension. The job configuration file is reloaded once the change is detected. In the example above, if a change is made tofoo2.job
, it is reloaded with properties included fromcommon.properties
andfoo.properties
in that order. - Changing an existing common properties file with a
.properties
extension. All job configuration files that include properties in the common properties file will be reloaded once the change is detected. In the example above, ifbar.properties
is updated, job configuration filesbar1.job
,bar2.job
,baz1.pull
, andbaz2.pull
will be reloaded. Properties frombar.properties
will be included when loadingbar1.job
andbar2.job
. Properties frombar.properties
andbaz.properties
will be included when loadingbaz1.pull
andbaz2.pull
in that order.
Note that this job configuration file change monitoring mechanism uses the FileAlterationMonitor
of Apache's commons-io with a custom FileAlterationListener
. Regardless of how close two adjacent file system checks are, there are still chances that more than one files are changed between two file system checks. In case more than one file including at least one common properties file are changed between two adjacent checks, the reloading of affected job configuration files may be intermixed and applied in an order that is not desirable. This is because the order the listener is called on the changes is not controlled by Gobblin, but instead by the monitor itself. So the best practice to use this feature is to avoid making multiple changes together in a short period of time.
Scheduled Jobs
Gobblin ships with a job scheduler backed by a Quartz scheduler and supports Quartz's cron triggers. A job that is to be scheduled should have a cron schedule defined using the property job.schedule
. Here is an example cron schedule that triggers every two minutes:
job.schedule=0 0/2 * * * ?
One Time Jobs
Some Gobblin jobs may only need to be run once. A job without a cron schedule in the job configuration is considered a run-once job and will not be scheduled but run immediately after being loaded. A job with a cron schedule but also the property job.runonce=true
specified in the job configuration is also treated as a run-once job and will only be run the first time the cron schedule is triggered.
Disabled Jobs
A Gobblin job can be disabled by setting the property job.disabled
to true
. A disabled job will not be loaded nor scheduled to run.