AWS Step Functions are hosted state-machines defined according to the Amazon States Language. To execute a Step function you send it JSON data which is given to an initial state to process then pass the output to another state. States are processed until a success or failure state is reached.
How a state processes its input and selects the next state depends on its Type. For example, a Task state can use a Lambda function to process the input, and a Choice state can select which state to go to next based on its input.
Step functions are awesome because they:
Explicitly define the order of execution, including all conditional paths, in a simple to understand model.
Perform common tasks, like calling Lambda functions, removing a ton of boilerplate code.
Handle errors and retrying in response to failure increasing reliability without sacrificing understandability.
Here is a small example where a state-machine calls out to a Lambda function and makes a choice based on its output:
{
"StartAt": "CallLambda",
"States": {
"CallLambda": {
"Type": "Task",
"Resource": "<lambda_arn>",
"Next": "Worked?",
"Retry": [{ "ErrorEquals": ["KnownError"] }],
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "Failure"
}]
},
"Worked?": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Worked",
"BooleanEquals": true,
"Next": "Success"
}
],
"Default": "Failure"
},
"Success": {
"Type": "Succeed"
},
"Failure": {
"Type": "Fail”
}
}
}
This state-machine looks like (generated with step dot --states <state_machine>):
StartAt defines the initial state CallLambda that executes the lambda at <lambda_arn>. The lambda’s output is then sent to Worked?, which goes to Success if its $.Worked attribute is true, otherwise it goes to Failure. If CallLambda returns a KnownError, it will Retry. For other errors it will go to Failure asStates.ALL is a catch-all for any error.
Lambda code and Step functions are separated from one another in AWS and can be developed independently. This can make them difficult to test and validate, as a change in one can cause a bug in the other. To make it easier to develop and test Step functions and Lambda we built the Step framework.
Here is an example of a state-machine using the Step framework:
func StateMachine() (*machine.StateMachine) {
state_machine, _ := machine.FromJSON([]byte(`{
"StartAt": "CallLambda",
"States": {
"CallLambda": {
"Type": "TaskFn",
"Next": "Worked?",
"Retry": [{ "ErrorEquals": ["KnownError"] }],
"Catch": [{
"ErrorEquals": ["States.ALL"],
"Next": "Failure"
}]
},
"Worked?": {
"Type": "Choice",
"Choices": [
{
"Variable": "$.Worked",
"BooleanEquals": true,
"Next": "Success"
}
],
"Default": "Failure"
},
"Success": {
"Type": "Succeed"
},
"Failure": {
"Type": "Fail”
}
}}`))
state_machine.SetResourceFunction("CallLambda", LambdaHandler)
return state_machine
}
The type TaskFn is an extension of the spec to tell the Lambda which Task is calling it so it can route to the correct handler.
LambdaHandler is the function that is called when the Task state CallLambda is reached:
type Input struct {}
type Result struct {
Worked bool
}
func LambdaHandler(_ context.Context, _ *Input) (Result, error) {
return Result{true}, nil
}
Handlers contain the logic. The path is controlled by the state-machine. State-machines can change the path based on the handlers output, but a handler cannot decide what state to jump to.
With Step a state-machine can be executed by calling StateMachine().Execute("{}"). This sends {} as an input into the machine and returns:
The final output.
The “path” of the states that were visited.
Errors encountered by the process.
This is used by tests:
func Test_Machine(t *testing.T) {
exec, err := StateMachine().Execute("{}")
assert.NoError(t, err)
assert.Equal(t, `{"Worked": true}`, exec.OutputJSON)
assert.Equal(t, []string{
"CallLambda",
"Worked?",
"Success",
}, exec.Path())
}
Fuzz tests are also very useful to help build reliable state-machines. The gofuzz library will randomly generate input to make sure no unhandled errors are returned:
func Test_With_Fuzz(t *testing.T) {
for i := 0; i < 50; i++ {
var input Input
fuzz.New().Fuzz(&input)
_, err := StateMachine().Execute(input)
if err != nil {
assert.NotRegexp(t, "Panic", err.Error())
}
// Other assertions like final states
}
}
The ultimate goal is to deploy the Step function and Lambda to AWS. For this we need an executable binary, let’s call it hello. hello executed without any arguments must start a Lambda with run.Lambda(StateMachine()). hello json should print the state-machine with run.JSON(StateMachine()).
The step binary can bootstrap (directly upload) hello to AWS. To install step:
go get github.com/coinbase/step
cd $GOPATH/src/github.com/coinbase/step
go build && go install
Then build and bootstrap hello:
# Build your code for the Lambdas linux environment
GOOS=linux go build -o lambda
zip lambda.zip lambda
# export AWS creds using https://github.com/coinbase/assume-role
assume-role account user
# Use step to upload your code and state-machine to AWS
step bootstrap \
-lambda "hello-lambda" \
-step "hello-step-function" \
-states "$(hello json)"
Step does not create the Lambda/IAM/Step function resources, these must be created first with a tool like terraform or geoengineer.
Here are a few good practices to follow using Step:
Handle All Errors: Every TaskFn should have a catch for States.ALL errors. This will ensure the state-machine ends in a proper state.
Fail Quickly: The faster a state-machine fails the less cleanup is needed. Fail if unknown JSON parameters are sent, if referenced resources don’t exist, or if other pre-conditions are not met.
Fuzz Input: As described above, using the gofuzz can save you a lot of time as it highlights errors caused by invalid input.
Comment: use the Comment attribute on states. The ultimate goal is to be able to fully understand the state-machine without looking at the code.
Design defensively: Step functions should behave predictably, especially when failing. Alert if a Step function execution finishes in an unexpected state.
Bridges are safer than swimming — San Francisco Golden Gate Bridge, photographed by Graham Jenson
While making deployers as Step functions, a set of conventions emerged which I am calling Bifrost. It is named after the mythical bridge because taking a bridge is easier (and safer) than swimming.
Deployers, at their core, productionize developed assets. For example, starting a server, pushing code to a Lambda, or uploading a new version of a package or container. Given this is the step in the development process that shows your hard work to the world, it should be very reliable.
Bifrost helps to build reliable deployers. By grouping together common concepts, a deployer’s code can focus on its core functionality.
The core of all deployers is the bifrost.Release struct:
type Release struct {
AwsAccountID *string `json:"aws_account_id,omitempty"`
AwsRegion *string `json:"aws_region,omitempty"`
ReleaseSHA256 string `json:"-"`
UUID *string `json:"uuid,omitempty"`
ReleaseID *string `json:"release_id,omitempty"`
ProjectName *string `json:"project_name,omitempty"`
ConfigName *string `json:"config_name,omitempty"`
Bucket *string `json:"bucket,omitempty"`
CreatedAt *time.Time `json:"created_at,omitempty"`
Timeout *int `json:"timeout,omitempty"`
Error *ReleaseError `json:"error,omitempty"`
Success *bool `json:"success,omitempty"`
}
To extend the release:
type DeployerRelease struct {
bifrost.Release
... // The attributes for your release
}
This model stores information needed to deploy, e.g. the list of services, paths to assets, SHA’s for validation. The release is:
The input and output for every state handler. This means each state has immediate access to all necessary information about the release.
Not secure. The state history log is persisted forever, so be careful with what you put in it.
Always validated. The Validate method on the release ensures everything is correct. This can be overridden, but should always call the original.
How a deployers state-machine is organized will depend on the asset being deployed. However, the end state should always be either:
"Success": { "Type": "Succeed" }: Deploy succeeded and everything is good.
FailureClean": { "Type": "Fail"}: Failed to deploy, but successfully cleaned up so a retry can be attempted.
"FailureDirty": { "Type": "Fail" }: Something went really bad, and you should alert someone to have a look.
This means that a Step execution can fail in a Clean expected way, or a very bad and Dirty way. If a state-machine execution ends in a FailureDirty state (or any state not Success or FailureClean) then someone needs to be alerted.
The state-machines TaskFn handlers should be thin with fat models. Handlers should be very obvious in their implementation and push the complexities to the Release model (which as stated above is the input and output of each handler). The goal is to make it easy to understand the impact a handler will have on a state-machine.
The first handler in your state-machine should look like this:
func Validate(awsc aws.Clients) DeployHandler {
return func(ctx context.Context, release *models.Release) (*models.Release, error) {
// Assign the release its SHA before anything alters it
release.ReleaseSHA256 = to.SHA256Struct(release)
// Extracts the region and account the Lambda is running in
// This is used to set defaults for release attributes
region, account := to.AwsRegionAccountFromContext(ctx)
release.SetDefaults(region, account, "coinbase-odin-")
if err := release.Validate(awsc.S3(nil, nil, nil)); err != nil {
return nil, &errors.BadReleaseError{err.Error()}
}
return release, nil
}
}
This function returns a handler that:
Calculates the input release’s SHA.
Sets the defaults of the release, including Region, Account and Bucket.
Validates the input release SHA against one uploaded to S3.
The reason this function returns a handler is so that the aws.Clients struct is persisted across calls. aws.Clients manages AWS clients where awsc.S3Client(nil, nil, nil) creates a S3 Client without assuming a role. This pattern is further described here.
It is neither secure nor practical to put all information into the release. For this and other functions we use S3. Each release has a bucket where:
/<account_id>/<project_name>/<config_name> is the root dir release.RootDir().
/<root_dir>/<release_id> is the release dir release.ReleaseDir().
These directories are useful as an audit trail, sending signals like Halt to the step function or release instances, and asset storage for things like Lambda zip files.
To assist in building Bifrost deployers we built a “paved path” archetypal implementation that is a basic EC2 deployer. Structure like:
./
├── .circleci/ # example CI setup
├── aws/
│ ├── ec2/ # example EC2 client
│ ├── mocks/ # mock AWS clients
│ └── aws.go # setup for multi-account AWS clients
├── client/
│ └── client.go # example client code
├── deployer/
│ ├── fuzz_test.go # fuzz test example
│ ├── integration_test.go # tests for the Deployer
│ ├── machine.go # state-machine definition
│ ├── handlers.go # handler functions for tasks
│ └── release.go # bifrost release struct
├── releases/
│ └── release.json # example release
├── scripts/
│ └── bootstrap_depolyer # bootstraping script
├── bifrost.go # executable code
├── Gopkg.toml # Go dependencies
└── Dockerfile # Build bifrost for deploy
To use this to start building your own deployer run:
export ORG=<your-org>
export DEPLOYER=<your_deployer>
git clone git@github.com:coinbase/bifrost.git $DEPLOYER
cd $DEPLOYER
scripts/rename
This will correctly rename the folder and references in the files to your deployer creating an easy starting place.
The deployers state-machine looks like:
Bifrost EC2 Example
It validates the input, locks the release, validates resources exist, deploys, waits a bit, checks if the deploy is healthy, then succeeds if healthy, fails if an error, or waits to retry and check later. Although the exact details of this deployer are not obvious, the overall flow of the state-machine is understandable.
Odin deploys 12 Factor applications into Auto-Scaling groups. To demonstrate how Bifrost’s conventions impact Odin’s implementation, let’s look at how Odin works.
Odin’s Release looks like:
type Release struct {
bifrost.Release
Services map[string]*Service `json:"services,omitempty"`
userdata *string // Not serialized
UserDataSHA256 *string `json:"user_data_sha256,omitempty"`
Healthy *bool `json:"healthy,omitempty"`
... // ignored LifecycleHooks, Subnets, Image
}
This Release struct at the center of Odin contains:
The list of services to be deployed.
userdata that might be sensitive so is not persisted, instead is uploaded to S3 and validated against a SHA UserDataSHA256.
A Healthy check to see if all its services are also healthy.
Odin’s state-machine looks like:
The “Happy path” is Validate, Lock, ValidateResources, Deploy, CheckHealthy, Healthy?, CleanUpSuccess, Success. As seen in the diagram at any point an error might occur and the state will retry, or catch the error and clean up. This is very similar to the Bifrost example with a few extra paths to recover from failure.
The goal of this post was to give an introduction to Step, Step functions, and how to build a deployer with Bifrost. Our goal is to automate as many different deployers as we can to reduce toil, increase security and make processes easier to understand.
For more discussion on the above topics see Baking Bread with Step, Open sourcing Odin, and Hitchhiker’s Guide to AWS Step Functions.
Unless otherwise indicated, all images provided herein are by Coinbase.
This website may contain links to third-party websites or other content for information purposes only (“Third-Party Sites”). The Third-Party Sites are not under the control of Coinbase, Inc., and its affiliates (“Coinbase”), and Coinbase is not responsible for the content of any Third-Party Site, including without limitation any link contained in a Third-Party Site, or any changes or updates to a Third-Party Site. Coinbase is not responsible for webcasting or any other form of transmission received from any Third-Party Site. Coinbase is providing these links to you only as a convenience, and the inclusion of any link does not imply endorsement, approval or recommendation by Coinbase of the site or any association with its operators.
Product,
Dec 4, 2024