Table of Contents
Using Gobblin as a Library
A Gobblin ingestion flow can be embedded into a java application using the EmbeddedGobblin
class.
The following code will run a Hello-World Gobblin job as an embedded application using a template. This will simply print "Hello World \<i>!" to stdout a few times.
EmbeddedGobblin embeddedGobblin = new EmbeddedGobblin("TestJob")
.setTemplate(ResourceBasedJobTemplate.forResourcePath("templates/hello-world.template"));
JobExecutionResult result = embeddedGobblin.run();
Note: EmbeddedGobblin
starts and destroys an embedded Gobblin instance every time run()
is called. If an application needs to run a large number of Gobblin jobs, it should instantiate and manage its own Gobblin driver.
Creating an Embedded Gobblin instance
The code snippet above creates an EmbeddedGobblin
instance. This instance can run arbitrary Gobblin ingestion jobs, and allows the use of templates. However, the user needs to configure the job by using the exact key needed for each feature.
An alternative is to use a subclass of EmbeddedGobblin
which provides methods to more easily configure the job. For example, an easier way to run a Gobblin distcp job is to use EmbeddedGobblinDistcp
:
EmbeddedGobblinDistcp distcp = new EmbeddedGobblinDistcp(sourcePath, targetPath).delete();
distcp.run();
This subclass automatically knows which template to use, the required configurations for the job (which are included as constructor parameters), and also provides convenience methods for the most common configurations (in the case above, the method delete()
instructs the job to delete files that exist in the target but not the source).
The following is a non-extensive list of available subclasses of EmbeddedGobblin
:
EmbeddedGobblinDistcp
: distributed copy between Hadoop compatible file systems.
EmbeddedWikipediaExample
: a getting started example that pulls page updated from Wikipedia.
Configuring Embedded Gobblin
EmbeddedGobblin
allows any configuration that a standalone Gobblin job would allow. EmbeddedGobblin
itself provides a few convenience methods to alter the behavior of the Gobblin framework. Other methods allow users to set a job template to use or set job level configurations.
Method | Parameters | Description |
---|---|---|
mrMode |
N/A | Gobblin should run on MR mode. |
setTemplate |
Template object to use | Use a job template. |
useStateStore |
State store directory | By default, embedded Gobblin is stateless and disables state store. This method enables the state store at the indicated location allowing using watermarks from previous jobs. |
distributeJar |
Path to jar in local fs | Indicates that a specific jar is needed by Gobblin workers when running in distributed mode (e.g. MR mode). Gobblin will automatically add this jar to the classpath of the workers. |
setConfiguration |
key - value pair | Sets a job level configuration. |
setJobTimeout |
timeout and time unit, or ISO period | Sets the timeout for the Gobblin job. run() will throw a TimeoutException if the job is not done after this period. (Default: 10 days) |
setLaunchTimeout |
timeout and time unit, or ISO period | Sets the timeout for launching Gobblin job. run() will throw a TimeoutException if the job has not started after this period. (Default: 10 seconds) |
setShutdownTimeout |
timeout and time unit, or ISO period | Sets the timeout for shutting down embedded Gobblin after the job has finished. run() will throw a TimeoutException if the method has not returned within the timeout after the job finishes. Note that a TimeoutException may indicate that Gobblin could not release JVM resources, including threads. |
Additional to the above, subclasses of EmbeddedGobblin
might offer their own convenience methods.
Running Embedded Gobblin
After EmbeddedGobblin
has been configured it can be run with one of two methods:
run()
: blocking call. Returns a JobExecutionResult
after the job finishes and Gobblin shuts down.
runAsync()
: asynchronous call. Returns a JobExecutionDriver
, which implements Future<JobExecutionResult>
.
Extending Embedded Gobblin
Developers can extend EmbeddedGobblin
to provide users with easier ways to launch a particular type of job. For an example see EmbeddedGobblinDistcp
.
Best practices:
Generally, a subclass of EmbeddedGobblin
is based on a template. The template should be automatically loaded on construction and the constructor should call setTemplate(myTemplate)
.
All required configurations for a job should be parsed from the constructor arguments. User should be able to run new MyEmbeddedGobblinExtension(params...).run()
and get a sensible job run.
* Convenience methods should be added for the most common configurations users would want to change. In general a convenience method will call a few other methods transparently to the user. For example:
public EmbeddedGobblinDistcp simulate() {
this.setConfiguration(CopySource.SIMULATE, Boolean.toString(true));
return this;
}
- If the job requires additional jars in the workers that are not part of the minimal Gobblin ingestion classpath (see
EmbeddedGobblin#getCoreGobblinJars
for this list), then the constructor should calldistributeJar(myJar)
for the additional jars.