HashiCode Ep. 1: Terraform Remote State Backend Locking

This is the first installment of HashiCode, a blog post series where I go through the source code of HashiCorp tools to learn more about what happens behind the scenes when you interact with tools as a user.

Disclaimer: this episode is referencing code from the Terraform codebase as of commit 43a7548. Becuase Terraform is a constanly evolving open source tool, the code is subject to change. However, the ideas expressed will largely remain the same.

One of the issues that teams using Terraform to provision infrastructure run into quickly is managing who is changing what and at what time are they doing it. Terraform introduces the concept of remote state, which allows users to interact with the same existing infrastructure resources. While this is powerful, it also provides more opportunities for issues. One of the most common solutions is locking.

If you have ever interacted with a database, you may be familiar with the concept of locking. It solves the problem of data races when two different sessions are attempting to modify the same value at the same time. A common example is banking: imagine you have $100 in your bank account. If you deposited $20 and another user on your bank account deposited $50 at the same time, there is potential that one transaction overwrites the other. In short, if the second transaction begins before the first transaction ends, your bank balance will only be $150 instead of $170 when the second transaction ends. This is because when the second transaction started the bank account had $100 in it, and it increased that value by $50 and committed it.

Locking solves this by locking writes to the database when a transaction starts. So going back to the bank example, when the first transaction begins it would take out a lock on the account, and when the second transaction wanted to begin, it would be told to wait until the first finished. The first would finish, bringing the balance to $120, then release the lock. The second would then take out a lock, and since the first transaction has completed and been committed, would then increase the balance to $150, before releasing the lock. This is a simple example, and there are multiple locking strategies that can be implemented to address certain types of collisions, but I will leave you to research more about that on your own. One of my favorite summaries is here, and if you are interested in how locking is actually implemented in common database management systems take a look here.

You may have already begun to see how this could come up in modifying remote state, and how a .tfstate file is much like a database. Just like collisions may occur when multiple parties are writing to a database, collisions may also occur when modifying infrastructure. For instance, if I want to add a target group to an application load balancer on AWS, and you want to delete that load balancer at the same time, what happens? Similarly, if I want to reference outputs from the state of another Terraform configuration, what happens if the state of that configuration is being changed while I try to read it? As previously mentioned, if we are sharing state remotely, we are interacting with the same source of truth.

S3 is a common place to store shared state files for Terraform. Terraform allows for the use of multiple types of backends, and S3 has been one of the most popular since it was implemented as a remote state wrapper by Gruntwork’s Terragrunt prior to officially being fully implemented within Terraform itself. It works with DynamoDB to allow for full backend functionality, which includes storage, versioning, encryption, and locking. As we look through the actual Terraform source code to see how S3 is implemented as a backend, we can gain a greater understanding for what happens when we use it and how we can leverage it within an organization.

A Background on Backends Link to heading

If you have ever used Terraform you are probably familair with the concept of .tfstate files. Whether you know exactly how they are constructed, or you just know that they are a remnant of running most Terraform operations, they are handling a big part of the functionality of Terraform. They keep track of what has already been provisioned, which allows Terraform to know what exists, how it can be updated / deleted, and how new resources can interact with existing. If you were to delete a .tfstate file, then run terraform apply, Terraform would recreate all the resources defined in your configuration, and you be no longer able to manage the existing resources that you previously deployed.

Backends come in two flavors in Terraform: standard and enhanced. The difference between the two is pretty straightforward. Most backends are standard, which means they basically just manage state. Enhanced backends do this as well, but also can execute remote operations so that you can initiate Terraform commands from your local machine, but then go about your day as the provisioning process is offloaded to an enhanced backend. You can read more about the difference between types of backends here, but it is useful to think about backends by talking about the one that you are probably already familiar with: local.

Local itself is an enhanced backend because it both handles state and can execute operations. Backends are basically just engines, giving Terraform the compute and storage resources (and some light logic to interact with them) that it needs to work. Because we are primarily interested in the topic of remote state locking, we will focus in only on the the subset of functionality that is encompassed by a standard backend. So how does the local backend handle state? Well we already alluded to it earlier. It uses your local file system and creates .tfstate files. But it actually doesn’t have to! While there are two types of backends, we actually always are using an enhanced backend. This is because to ever interact with a standard backend we must go through an enhanced one (namely local, more on this later).

The local backend, which is of type backend.Enhanced is actually used to execute all of the backends of type backend.Backend.

To truly understand this, we must look into the source code. HashiCorp tools are mostly written in Go (Vagrant, the oldest tool, is written in Ruby). A common design pattern in Go, and most programming languages for that matter, is to define an interface which describes the functionality of a broad type, and then have multiple implementations that adhere to the specifications. HashiCorp utilizes this pattern quite heavily, and it allows for the tools to be built using a “plugin” architecture, which has been a significant factor in their rapid adoption and growth (we will certainly be diving deeper into this plugin architecture in later episodes of HashiCode). Importantly, both the “internal” (or built-in) and extendable parts of Terraform are constructed in this manner, and backends, which would be considered internal in this case are no exception.

The general Backend interface is defined in terraform/backend/backend.go. Here is a snippet of it in which I have removed code comments for brevity:

// Backend is the minimal interface that must be implemented to enable Terraform.
type Backend interface {
	ConfigSchema() *configschema.Block
	PrepareConfig(cty.Value) (cty.Value, tfdiags.Diagnostics)
	Configure(cty.Value) tfdiags.Diagnostics
	StateMgr(workspace string) (statemgr.Full, error)
	DeleteWorkspace(name string) error
	Workspaces() ([]string, error)
}

As the comment on the exported Backend interface suggests, this is the “minimal interface that must be implemented”. If we go back and think about our two types of backends (standard and enhanced) and remember that the functionality of an enhanced backend is a superset of that of the standard backend, then we can reach the conclusion that a standard backend must implement these six methods, while an enhanced backend must implement these six plus some more. And the code reflects this! If we look just below the Backend interface, we see the Enhanced interface (comments preserved this time):

// Enhanced implements additional behavior on top of a normal backend.
//
// Enhanced backends allow customizing the behavior of Terraform operations.
// This allows Terraform to potentially run operations remotely, load
// configurations from external sources, etc.
type Enhanced interface {
	Backend

	// Operation performs a Terraform operation such as refresh, plan, apply.
	// It is up to the implementation to determine what "performing" means.
	// This DOES NOT BLOCK. The context returned as part of RunningOperation
	// should be used to block for completion.
	// If the state used in the operation can be locked, it is the
	// responsibility of the Backend to lock the state for the duration of the
	// running operation.
	Operation(context.Context, *Operation) (*RunningOperation, error)
}

Here we see another powerful property of Go, the ability for one interface to wrap another. This concept is called composition. Most programming languages allow for composition, but users frequently confuse composition with inheritance. Go does not provide classes or inheritance, but instead achieves polymorphism through composition and struct embedding. In this way, eliminates the fundamental problem of a fragile base class in object oriented programming. Composition an interface esstentially is just saying “I will do everything this interface does, plus some more”. So here, Enhanced is saying that it will do everything that Backend does, but also implement the ability to execute Operation(). This is exactly what we stated earlier about standard and enhanced backends: enhanced backends can do everything standard ones do, but can also execute remote operations. Do not take the clarity of this code for granted. You will see many code bases that are not near as readable / understandable as this, which is both a nod to the Go language and also to Mitchell Hashimoto, one of the founders of HashiCorp and the writer of this very block of code (You can actually check out the commit message where this was added here. The fact that it has not been touched for two years is a pretty solid indication of the effectiveness of the initial implementation).

So back to our original question: why do we have to go through an enhanced backend to interact with a standard one? Well if we consider that Terraform must always be able to execute operations and must always be able to handle state, and we also acknowledge that both of these things are always handled by a backend, then we reach the conclusion that we can never only be using a standard backend because we would be unable to execute any operations. If this still feels confusing, let’s consider that there are only two types of enhanced backends: local and remote (i.e. Terraform Enterprise). Remote performs operations (i.e. Operation()) and manages state (i.e. Backend) in Terraform Enterprise. Local, on the other hand, performs operations locally, but can substitute either local or a remote-state Backend to manage state.

This is a good time to look at the structure of the terraform/backend directory.

backend/
    atlas/          # Legacy backend Atlas support
    init/           # Provides initialization methods for each backend type to Terraform
    local/          # Local backend that implements Enhanced
    remote/         # Remote (TF Enterprise) backend that implements Enhanced
    remote-state/   # Remote state backends that implement Backend
    backend.go      # Backend and Enhanced interfaces
    ...

As you can see, the directory structure closely reflects the separation of functionality for each backend. If we look into terraform/backend/init/init.go we can see how each backend implementation is exposed to Terraform:

// Init initializes the backends map with all our hardcoded backends.
func Init(services *disco.Disco) {
	backendsLock.Lock()
	defer backendsLock.Unlock()

	backends = map[string]backend.InitFn{
		// Enhanced backends.
		"local":  func() backend.Backend { return backendLocal.New() },
		"remote": func() backend.Backend { return backendRemote.New(services) },

		// Remote State backends.
		"artifactory": func() backend.Backend { return backendArtifactory.New() },
		"atlas":       func() backend.Backend { return backendAtlas.New() },
		"azurerm":     func() backend.Backend { return backendAzure.New() },
		"consul":      func() backend.Backend { return backendConsul.New() },
		"etcd":        func() backend.Backend { return backendEtcdv2.New() },
		"etcdv3":      func() backend.Backend { return backendEtcdv3.New() },
		"gcs":         func() backend.Backend { return backendGCS.New() },
		"http":        func() backend.Backend { return backendHTTP.New() },
		"inmem":       func() backend.Backend { return backendInmem.New() },
		"manta":       func() backend.Backend { return backendManta.New() },
		"oss":         func() backend.Backend { return backendOSS.New() },
		"pg":          func() backend.Backend { return backendPg.New() },
		"s3":          func() backend.Backend { return backendS3.New() },
		"swift":       func() backend.Backend { return backendSwift.New() },

		// Deprecated backends.
		"azure": func() backend.Backend {
			return deprecateBackend(
				backendAzure.New(),
				`Warning: "azure" name is deprecated, please use "azurerm"`,
			)
		},
	}
}

Now let’s venture into terraform/backend/local/backend.go. Luckily for us, the comments are verbose and do a great job of explaining exactly what we are looking at:

// Local is an implementation of EnhancedBackend that performs all operations
// locally. This is the "default" backend and implements normal Terraform
// behavior as it is well known.
type Local struct {
	// CLI and Colorize control the CLI output. If CLI is nil then no CLI
	// output will be done. If CLIColor is nil then no coloring will be done.
	CLI      cli.Ui
	CLIColor *colorstring.Colorize

	// ShowDiagnostics prints diagnostic messages to the UI.
	ShowDiagnostics func(vals ...interface{})

	// The State* paths are set from the backend config, and may be left blank
	// to use the defaults. If the actual paths for the local backend state are
	// needed, use the StatePaths method.
	//
	// StatePath is the local path where state is read from.
	//
	// StateOutPath is the local path where the state will be written.
	// If this is empty, it will default to StatePath.
	//
	// StateBackupPath is the local path where a backup file will be written.
	// Set this to "-" to disable state backup.
	//
	// StateWorkspaceDir is the path to the folder containing data for
	// non-default workspaces. This defaults to DefaultWorkspaceDir if not set.
	StatePath         string
	StateOutPath      string
	StateBackupPath   string
	StateWorkspaceDir string

	// The OverrideState* paths are set based on per-operation CLI arguments
	// and will override what'd be built from the State* fields if non-empty.
	// While the interpretation of the State* fields depends on the active
	// workspace, the OverrideState* fields are always used literally.
	OverrideStatePath       string
	OverrideStateOutPath    string
	OverrideStateBackupPath string

	// We only want to create a single instance of a local state, so store them
	// here as they're loaded.
	states map[string]statemgr.Full

	// Terraform context. Many of these will be overridden or merged by
	// Operation. See Operation for more details.
	ContextOpts *terraform.ContextOpts

	// OpInput will ask for necessary input prior to performing any operations.
	//
	// OpValidation will perform validation prior to running an operation. The
	// variable naming doesn't match the style of others since we have a func
	// Validate.
	OpInput      bool
	OpValidation bool

	// Backend, if non-nil, will use this backend for non-enhanced behavior.
	// This allows local behavior with remote state storage. It is a way to
	// "upgrade" a non-enhanced backend to an enhanced backend with typical
	// behavior.
	//
	// If this is nil, local performs normal state loading and storage.
	Backend backend.Backend

	// RunningInAutomation indicates that commands are being run by an
	// automated system rather than directly at a command prompt.
	//
	// This is a hint not to produce messages that expect that a user can
	// run a follow-up command, perhaps because Terraform is running in
	// some sort of workflow automation tool that abstracts away the
	// exact commands that are being run.
	RunningInAutomation bool

	// opLock locks operations
	opLock sync.Mutex
}

var _ backend.Backend = (*Local)(nil)

// New returns a new initialized local backend.
func New() *Local {
	return NewWithBackend(nil)
}

// NewWithBackend returns a new local backend initialized with a
// dedicated backend for non-enhanced behavior.
func NewWithBackend(backend backend.Backend) *Local {
	return &Local{
		Backend: backend,
	}
}

The first thing you may notice is the Local struct, which has comments that conveniently tell us that it implements EnhancedBackend (i.e. backend.Enhanced). As we have said many times, an enhanced backend (backend.Enhanced) must implement everything that a standard (backend.Backend) does, plus be able to perform operations. If you look farther down in this file, you will see the following methods implemented:

func (b *Local) ConfigSchema() *configschema.Block
func (b *Local) PrepareConfig(obj cty.Value) (cty.Value, tfdiags.Diagnostics)
func (b *Local) Configure(obj cty.Value) tfdiags.Diagnostics
func (b *Local) Workspaces() ([]string, error)
func (b *Local) DeleteWorkspace(name string) error
func (b *Local) StateMgr(name string) (statemgr.Full, error)

We recognize these as implementations of the six required methods found in backend.Backend. So we can think of these as the standard backend methods that are used to handle state. You can also see in the Local struct that there is a field Backend of type backend.Backend. This reminds us a lot of how the backend.Enhanced interface (which Local is implementing here) wrapped backend.Backend! However, I said earlier that local may manage state itself, or substitute in one of the remote-state backends to do so. How does that work? Well let’s look back up to New() and NewWithBackend() just after the Local struct. If you look closely, you can see that New() simply calls NewWithBackend(nil) which sets the Backend field of Local to nil. Then, if we look at each of the six methods defined to implement backend.Backend, we notice that they all start with the following code block:

    if b.Backend != nil {
		return b.Backend.ConfigSchema() # This is the example from ConigSchema()
	}

This basically says if we have substituted in another backend.Backend (i.e. b.Backend != nil) then turn over the implementation of this method to that backend. Otherwise, the method as defined in Local will execute. This is the behavior that you may be familiar with. If you look through these methods and their child methods, you will notice the functionality defined to write local .tfstate files and manage all state in your filesystem. However, the Operation() method, which is also defined farther down in this file, does not contain the same code block to check which backend is configured. This is because Operation() is part of backend.Enhanced, so the given backend.Backend is standard so is not able to execute operations.

In short, Local must manage operations, but may or may not manage state.

The S3 Remote State Backend Link to heading

So let’s take a look at one of the backend.Backend implementations that Local might push off handling state to. These are kept in terraform/backend/remote-state. Upon navigating to the directory you will see a multitude of implementations ranging from Consul to Postgres. Each of the implementation directories contains at least the following three files:

backend.go          # Creation of the given remote state backend (implements backend.Backend)
backend_state.go    # Implementation of required backend.Backend methods
client.go           # Implementation of methods used to interact with the remote state provider

Upon further examination of each of the remote state implementations, you will notice that their backend.go files all follow a similar pattern. Namely, they all embed *schema.Backend. We briefly mentioned earlier that another way that Go achieves polymorphism is through struct embedding, and that is exactly what is happening here. The schema package lives in the helper/ directory, which is one that you may want to become familiar with if you plan on contributing to Terraform yourself. It contains many helpful libraries for abstracting away complexity that is common across a certain class of Terraform components and don’t need to be reimplemented. In fact, that is exactly what it does here for our remote state backends. Let’s take a look at terraform/backend/remote-state/s3/backend.go. Skip over the New() function for a moment and check out the Backend struct:

type Backend struct {
	*schema.Backend

	// The fields below are set from configure
	s3Client  *s3.S3
	dynClient *dynamodb.DynamoDB

	bucketName           string
	keyName              string
	serverSideEncryption bool
	acl                  string
	kmsKeyID             string
	ddbTable             string
	workspaceKeyPrefix   string
}

At the top, you’ll see the aforementioned embedding of *schema.Backend, as well as a number of other fields specifically related to S3 as a remote state provider. Now if we look back at the New() function (exluded here for brevity), we will notice the creation of a *schema.Backend, which is a pointer to a struct, configured once again with S3 specific components. The schema is then embedded into a struct of type s3.Backend before being returned. You’ll remember from where we looked at Init() above that this is exactly how an S3 remote state backend gets initialized.

So why does each of the remote state providers embed *schema.Backend? To understand, it is helpful to remember that the backend.Backend interface (or as we like to call it, the standard backend interface) requires the implementation of six methods. Unsurprisingly, all of the remote state backends implement a few of those methods in very similar ways. ConigSchema(), PrepareConfig(), and Configure() share much of the same steps for each, so instead of reimplementing them every time we add a new remote state provider, we instead choose for each remote state backend to adhere to a contract with *schema.Backend saying that they will provide the necessary information for *schema.Backend to execute those methods for them (notice that they do cheat on this somewhat by setting the ConfigureFunc for *schema.Backend to call if additional custom configuration is needed for the provider). If you look at the Backend implementation in terraform/helper/schema/backend.go, the comments reinforce what we have just discovered:

// Backend represents a partial backend.Backend implementation and simplifies
// the creation of configuration loading and validation.
//
// Unlike other schema structs such as Provider, this struct is meant to be
// embedded within your actual implementation. It provides implementations
// only for Input and Configure and gives you a method for accessing the
// configuration in the form of a ResourceData that you're expected to call
// from the other implementation funcs.
type Backend struct {
	// Schema is the schema for the configuration of this backend. If this
	// Backend has no configuration this can be omitted.
	Schema map[string]*Schema

	// ConfigureFunc is called to configure the backend. Use the
	// FromContext* methods to extract information from the context.
	// This can be nil, in which case nothing will be called but the
	// config will still be stored.
	ConfigureFunc func(context.Context) error

	config *ResourceData
}

Now let’s look at S3’s backend_state.go file. This is where the remaining three backend.Backend methods are implemented: Workspaces(), DeleteWorkspace(), and StateMgr(). You’ll also notice an unexported function (Go fields and functions are always exported if capitalized, always unexported if lowercase, read more here) named remoteClient():

// get a remote client configured for this state
func (b *Backend) remoteClient(name string) (*RemoteClient, error) {
	if name == "" {
		return nil, errors.New("missing state name")
	}

	client := &RemoteClient{
		s3Client:             b.s3Client,
		dynClient:            b.dynClient,
		bucketName:           b.bucketName,
		path:                 b.path(name),
		serverSideEncryption: b.serverSideEncryption,
		acl:                  b.acl,
		kmsKeyID:             b.kmsKeyID,
		ddbTable:             b.ddbTable,
	}

	return client, nil
}

What you see here is all of those S3-specific config data being passed to a RemoteClient struct. Then if you look a little bit further down in the StateMgr() function, you’ll notice that remoteClient() is the first thing that is called. Once again, this is a common pattern that you will see across the remote state backends: create a client and then use it to interact with the state provider. In the case of S3, we are passing in thing such as the AWS S3 and DynamoDB Go clients, the bucket name, the path for the .tfstate file, and various other values. All of these are used to write state files to an S3 bucket, as well as handle locking.

So what about this locking business? Link to heading

We briefly defined what locking is at the beginning, but let’s revisit now that we have a little more context. Locking solves the problem of simulatenous access to state. If I am updating the state, I don’t want you updating it or reading it until my changes are complete and committed. As we dive even further into the S3 remote state backend, we will see one implementation of how locking can be applied to remote state in Terraform.

Going back to the StateMgr() function, let’s examine what it does after creating a remote client:

func (b *Backend) StateMgr(name string) (state.State, error) {
	client, err := b.remoteClient(name)
	if err != nil {
		return nil, err
	}

	stateMgr := &remote.State{Client: client}
	// Check to see if this state already exists.
	// If we're trying to force-unlock a state, we can't take the lock before
	// fetching the state. If the state doesn't exist, we have to assume this
	// is a normal create operation, and take the lock at that point.
	//
	// If we need to force-unlock, but for some reason the state no longer
	// exists, the user will have to use aws tools to manually fix the
	// situation.
	existing, err := b.Workspaces()
	if err != nil {
		return nil, err
	}

	exists := false
	for _, s := range existing {
		if s == name {
			exists = true
			break
		}
	}

	// We need to create the object so it's listed by States.
	if !exists {
		// take a lock on this state while we write it
		lockInfo := state.NewLockInfo()
		lockInfo.Operation = "init"
		lockId, err := client.Lock(lockInfo)
		if err != nil {
			return nil, fmt.Errorf("failed to lock s3 state: %s", err)
		}

		// Local helper function so we can call it multiple places
		lockUnlock := func(parent error) error {
			if err := stateMgr.Unlock(lockId); err != nil {
				return fmt.Errorf(strings.TrimSpace(errStateUnlock), lockId, err)
			}
			return parent
		}

		// Grab the value
		// This is to ensure that no one beat us to writing a state between
		// the `exists` check and taking the lock.
		if err := stateMgr.RefreshState(); err != nil {
			err = lockUnlock(err)
			return nil, err
		}

		// If we have no state, we have to create an empty state
		if v := stateMgr.State(); v == nil {
			if err := stateMgr.WriteState(states.NewState()); err != nil {
				err = lockUnlock(err)
				return nil, err
			}
			if err := stateMgr.PersistState(); err != nil {
				err = lockUnlock(err)
				return nil, err
			}
		}

		// Unlock, the state should now be initialized
		if err := lockUnlock(nil); err != nil {
			return nil, err
		}

	}

	return stateMgr, nil
}

The first thing that jumps out is that we are wrapping the client in a struct of type remote.State. If we jump over to terraform/state/remote/state.go we can take a look at exactly what remote.State encompasses and implements:

// State implements the State interfaces in the state package to handle
// reading and writing the remote state. This State on its own does no
// local caching so every persist will go to the remote storage and local
// writes will go to memory.
type State struct {
	mu sync.Mutex

	Client Client

	lineage          string
	serial           uint64
	state, readState *states.State
	disableLocks     bool
}

Methods implemented by remote.State:

func (s *State) State() *states.State
func (s *State) StateForMigration() *statefile.File
func (s *State) WriteState(state *states.State) error
func (s *State) WriteStateForMigration(f *statefile.File, force bool) error
func (s *State) RefreshState() error
func (s *State) refreshState() error
func (s *State) PersistState() error
func (s *State) Lock(info *state.LockInfo) (string, error)
func (s *State) Unlock(id string) error
func (s *State) DisableLocks()
func (s *State) StateSnapshotMeta() statemgr.SnapshotMeta

Now we are really getting somewhere! These look like methods that might be called by commands with which we are familiar. We can think about what is actually happening when we plan, apply, or destroy. Here we are able to write state, referesh state, lock, unlock, and much more. These methods are all defined as interfaces in the terraform/states/statemgr package, which I would encourage you to look through, along with the places they are actually called in like terraform/command/meta_backend.go. You can also wait until the last section of the episode, where I will quickly tie everything together.

Before we move on to the S3 client, check out the Client interface in the same directory: terraform/state/remote/remote.go. Here we define the methods we need to implement for our S3 client to be a compatible remote state backend:

// Client is the interface that must be implemented for a remote state
// driver. It supports dumb put/get/delete, and the higher level structs
// handle persisting the state properly here.
type Client interface {
	Get() (*Payload, error)
	Put([]byte) error
	Delete() error
}

// ClientLocker is an optional interface that allows a remote state
// backend to enable state lock/unlock.
type ClientLocker interface {
	Client
	state.Locker
}

Since we are using a remote state backend that does in fact implement locking, we will be using the ClientLocker interface, which requires everything that Client does plus everything that state.Locker does (hint: state.Locker is actually just an alias for statemgr.Locker, which only required the Lock and Unlock methods). If we look at our S3 client in terraform/backend/remote-state/s3/client.go, we see that it does in fact implement Get(), Put(), Delete(), Lock(), and Unlock(). These methods are called thoughout the methods implemented in remote.State, but since we are focused on Lock() and Unlock(), which are used before and after each operation respectively, we can just look back at our StateMgr() function for an example.

Essentially what this function is doing is taking a lock, initializing the state in S3 (i.e. writing an empty .tfstate file in the specified path), then unlocking. These actions are well defined because of strong naming conventions and verbose comments, but we have yet to look into what literally happens when we take a lock. Some of it happens in the StateMgr() function:

        // take a lock on this state while we write it
		lockInfo := state.NewLockInfo()
		lockInfo.Operation = "init"
		lockId, err := client.Lock(lockInfo)
		if err != nil {
			return nil, fmt.Errorf("failed to lock s3 state: %s", err)
		}

We use some helpful utilies to generate our lockInfo, but then we hand it off to the S3 client. In the client, Lock() looks like this:

func (c *RemoteClient) Lock(info *state.LockInfo) (string, error) {
	if c.ddbTable == "" {
		return "", nil
	}

	info.Path = c.lockPath()

	if info.ID == "" {
		lockID, err := uuid.GenerateUUID()
		if err != nil {
			return "", err
		}

		info.ID = lockID
	}

	putParams := &dynamodb.PutItemInput{
		Item: map[string]*dynamodb.AttributeValue{
			"LockID": {S: aws.String(c.lockPath())},
			"Info":   {S: aws.String(string(info.Marshal()))},
		},
		TableName:           aws.String(c.ddbTable),
		ConditionExpression: aws.String("attribute_not_exists(LockID)"),
	}
	_, err := c.dynClient.PutItem(putParams)

	if err != nil {
		lockInfo, infoErr := c.getLockInfo()
		if infoErr != nil {
			err = multierror.Append(err, infoErr)
		}

		lockErr := &state.LockError{
			Err:  err,
			Info: lockInfo,
		}
		return "", lockErr
	}

	return info.ID, nil
}

We breezed by DynamoDB earlier, but if you are not familiar, it is a NoSQL document database that is fully managed by AWS. If you have ever used MongoDB, it is very similar. What we want to do with a lock is tell everyone else that they can’t access the state right now, while also checking that no one else is already accessing it. Going sequentially through the function implementation:

We check to make sure a DynamoDB table name was defined on backend configuration
We check if the lock ID that is passed in is empty, and we generate a UUID for it if it is
We write the lock info to the Dynamo table
If there is already a lock at the same state path, we fail and return error
Otherwise, we return the info.ID, which is now part of the Info document at our state path in DynamoDB

Then, back to StateMgr(), after initializing our state in S3 (feel free to look WriteState() and PersistState() functions for a deeper understanding), all of our changes in regards to .tfstate are complete and we can Unlock() by means of the lockUnlock() helper function. Unlock() is implemented in the S3 client as follows:

func (c *RemoteClient) Unlock(id string) error {
	if c.ddbTable == "" {
		return nil
	}

	lockErr := &state.LockError{}

	// TODO: store the path and lock ID in separate fields, and have proper
	// projection expression only delete the lock if both match, rather than
	// checking the ID from the info field first.
	lockInfo, err := c.getLockInfo()
	if err != nil {
		lockErr.Err = fmt.Errorf("failed to retrieve lock info: %s", err)
		return lockErr
	}
	lockErr.Info = lockInfo

	if lockInfo.ID != id {
		lockErr.Err = fmt.Errorf("lock id %q does not match existing lock", id)
		return lockErr
	}

	params := &dynamodb.DeleteItemInput{
		Key: map[string]*dynamodb.AttributeValue{
			"LockID": {S: aws.String(c.lockPath())},
		},
		TableName: aws.String(c.ddbTable),
	}
	_, err = c.dynClient.DeleteItem(params)

	if err != nil {
		lockErr.Err = err
		return lockErr
	}
	return nil
}

Once again going sequentially:

We check to make sure a DynamoDB table name was defined on backend configuration
Get the current lock info at the our state path in DynamoDB, return if unable or does not exist
Make sure the lock id retrieved matches the one that we put there when we used Lock(), return error if not
Delete our lock from DynamoDB, return nil error (lock deleted, everything okay)

This is pretty simple, but very powerful. These two simple locking and unlocking functions allow us to make sure that no one else in our organization can access state as it is being modified, and we likewise cannot get a lock on the state if someone else has one.

Command Line to Code Link to heading

Wow we have jumped around a lot! It is okay if it is overwhelming. To bring it all together, let’s start as a user, and go all the way to the the depths of the code that we reached. So where to begin?

Locking doesn’t only take place when we initialize state. When it is enabled, it is used every time state is modified or accessed in any way. The nice thing about the way Terraform is architected is that if we want to add a new command to the CLI, we don’t have to worry about locking. We know that a remote state backend that claims to have locking enabled must be implementing it (although unless you look at the source code you are trusting that it implements it correctly). Let’s look at a simple example: reading the current state for a given configuration. The Terraform entrypoint is terraform/main.go, which initializes all of the available commands in terraform/command.go. For the actual implementation of the various commands, you must look into the terraform/commands directory. The command we need to read the current state is: terraform state list. The implementation lives in terraform/command/state_list.go. The primary functionality is in Run():

func (c *StateListCommand) Run(args []string) int {
	args, err := c.Meta.process(args, true)
	if err != nil {
		return 1
	}

	var statePath string
	cmdFlags := c.Meta.defaultFlagSet("state list")
	cmdFlags.StringVar(&statePath, "state", "", "path")
	lookupId := cmdFlags.String("id", "", "Restrict output to paths with a resource having the specified ID.")
	if err := cmdFlags.Parse(args); err != nil {
		return cli.RunResultHelp
	}
	args = cmdFlags.Args()

	if statePath != "" {
		c.Meta.statePath = statePath
	}

	// Load the backend
	b, backendDiags := c.Backend(nil)
	if backendDiags.HasErrors() {
		c.showDiagnostics(backendDiags)
		return 1
	}

	// Get the state
	env := c.Workspace()
	stateMgr, err := b.StateMgr(env)
	if err != nil {
		c.Ui.Error(fmt.Sprintf(errStateLoadingState, err))
		return 1
	}
	if err := stateMgr.RefreshState(); err != nil {
		c.Ui.Error(fmt.Sprintf("Failed to load state: %s", err))
		return 1
	}

	state := stateMgr.State()
	if state == nil {
		c.Ui.Error(fmt.Sprintf(errStateNotFound))
		return 1
	}

	var addrs []addrs.AbsResourceInstance
	var diags tfdiags.Diagnostics
	if len(args) == 0 {
		addrs, diags = c.lookupAllResourceInstanceAddrs(state)
	} else {
		addrs, diags = c.lookupResourceInstanceAddrs(state, args...)
	}
	if diags.HasErrors() {
		c.showDiagnostics(diags)
		return 1
	}

	for _, addr := range addrs {
		if is := state.ResourceInstance(addr); is != nil {
			if *lookupId == "" || *lookupId == states.LegacyInstanceObjectID(is.Current) {
				c.Ui.Output(addr.String())
			}
		}
	}

	c.showDiagnostics(diags)

	return 0
}

It is fairly large, so let’s narrow it down to the part we really care about, which is sandwiched in the middle:

    // Load the backend
	b, backendDiags := c.Backend(nil)
	if backendDiags.HasErrors() {
		c.showDiagnostics(backendDiags)
		return 1
	}

	// Get the state
	env := c.Workspace()
	stateMgr, err := b.StateMgr(env)
	if err != nil {
		c.Ui.Error(fmt.Sprintf(errStateLoadingState, err))
		return 1
	}
	if err := stateMgr.RefreshState(); err != nil {
		c.Ui.Error(fmt.Sprintf("Failed to load state: %s", err))
		return 1
	}

	state := stateMgr.State()
	if state == nil {
		c.Ui.Error(fmt.Sprintf(errStateNotFound))
		return 1
	}

Some of these calls should look familiar. The first thing we want to do is get the backend that has been configured for this session. c.Backend(nil) is calling the Backend() function in terraform/command/meta_backend.go. This is where all of the setup for initializing the backend happens. I won’t go through it here, but if you take a look you will see it checking if there is a remote state backend set, and if so, it is taking a local backend (enhanced) and injecting in the remote state backend (standard), just like we discussed early on in this post. Then, once we have setup the backend, we call that StateMgr() function that we spent so much time in and just examined for how it handles locking. This is the power of this type of construction. The state list command does not know or care which remote state backend is behind the scenes. It may use locking or it may not, but all the command cares about is that it satisfies its contract as specified by the backend.Backend (standard backend) interface.

Likewise, the RefreshState() and State() functions that we saw implemented in remote.State are obscured to the command, just like the Get(), Put(), Lock, and Unlock functions that are implemented in the s3.Client and called in RefreshState() and State() are obscured to remote.State. This separation of responsibilities lends itself to projects that can scale quickly, be extended by the community, and worked on in a distributed manner. It is no surprise that you see those very traits across the HashiCorp company.

Final Thoughts Link to heading

I find the most effective way to learn a new code base is to interact with the binary as a user, think of a question you have about how it works, then trace the command you issued to the question you had. We started out this episode wanting to learn how the S3 remote state backend did locking, but we walked away understanding more about how HashiCorp architects its tools, what polymorphism looks like in Go, and a new confidence in contributing to open source projects that seem large and daunting.

I love open source because it is about people. Good projects survive because of empathy, not because of engineering. This is the first episode of HashiCode, and I hope that it is another step towards caring for people in the community.

As always, please feel free to reach out with question or comments on Twitter by tagging me or directly messaging me at @hasheddan!