6.6 KiB
(state-snapshots)=
State management and snapshots
State is the metadata that Pulumi stores about the infrastructure it manages,
and, among other things, is key to enabling Pulumi to work out when to create,
update, replace and delete resources. A snapshot is a view of a Pulumi state
at a particular point in time. State is stored in a backend that can
be configured on a per-project basis. For these purposes, state is typically
serialized to a JSON format; this is also the format used by the stack export
and stack import
CLI commands.
(backends)=
Backends
A backend is an API and storage endpoint used by the Pulumi CLI to coordinate updates, reading and writing stack state whenever appropriate.
(diy)=
DIY backends
A DIY (do it yourself) backend is one in which a state JSON file is
persisted to a medium controlled and managed by the Pulumi user. Under the hood,
Pulumi uses the Go Cloud Development Kit (specifically,
its blob
package) to support a number of
storage implementations, from local files to cloud storage services such as AWS
S3, Google Cloud Storage, and Azure Blob Storage.
(httpstate)=
HTTP state backends
An HTTP state backend is one in which the state is managed by API calls to a remote HTTP service, which is responsible for managing the underlying state. Pulumi Cloud is the primary example of this.
(snapshot-integrity)=
Snapshot integrity
Integrity is a property of a snapshot that ensures that the snapshot is
consistent and can be safely operated upon. The
Snapshot.VerifyIntegrity
method is responsible for performing these checks. When a snapshot has an
integrity error, the Pulumi CLI will refuse to operate on it.1 Note that the
Pulumi CLI will not refuse to write a snapshot with integrity errors, since
snapshots are often the only way of recording what actions the engine has
already taken (and e.g. which of those succeeded and which failed), and that
record is vital should the user need to recover from a failure.
If you find yourself debugging a snapshot integrity issue, or if you are keen to avoid introducing one when writing new code, the following guidelines and general principles may be useful:
-
Reproduce or simulate potential issues with one or more lifecycle tests. Snapshot integrity issues are the result of the deployment engine mismanaging state. While bugs may manifest due to unexpected behaviour in resource providers or language hosts, for example, it is the engine's job to handle these cases correctly and preserve the integrity of its resource state. Lifecycle tests allow mocking providers and specifying programs directly without an intermediate language host, and provide the best means to consistently reproduce an issue or specify a desired behaviour. The lifecycle test suite's fuzzing capabilities may help when tracking down hard-to-find issues.
-
Avoid realising deletions until the end of an operation. Many snapshot integrity issues arise from resources ending up in state with missing dependencies, or with dependencies that appear later than they do in the snapshot (snapshots are expected to be topologically sorted). Deleting a resource from the state mid-deployment is almost guaranteed to result in these issues at some point. This is especially likely if a later operation fails and causes the deployment to terminate early, leaving later resources that you may have intended to update following the deletion in a broken state. Instead of outright removing a resource from the state, consider marking it as pending or needing deletion later on (this is how
deleteBeforeReplace
works, for example). That way, you can remove the resource at the end of the operation when you know that all of its dependencies have been processed (in the case ofdeleteBeforeReplace
, it is the finalCreateReplacementStep
that actually removes the old resource from the state, for instance). -
Consider all forms of dependencies. Providers, parents, dependencies, property dependencies, and deleted-with relationships are all forms of resource dependency that must be respected by any code being written or examined. If a resource is moved, renamed or deleted, and its dependencies are not updated, for instance, an integrity error is likely to occur.
-
Think about how code behaves when only specific resources are targeted. Targeted operations can violate many assumptions that are otherwise safe to make, such as having processed a resource's dependencies before the resource itself is visited. When debugging, ascertaining whether a snapshot integrity issue has been triggered by a targeted operation is often an excellent first step, since it can massively narrow down the code paths that need to be examined.
-
Many operations are non-atomic and nearly all of them can fail. Don't assume that processing a resource will always proceed smoothly. If the snapshot is to be modified before or after making a provider call, consider that the provider call could fail. Does the code account for this and work correctly even if it is resumed following a failure?
-
The program may change between operations. If you are debugging or attempting to reproduce an issue, consider that it may take multiple operations to trigger the issue and that the program being run may change between these operations. For instance, a resource may be removed from the program -- in these cases, there will be an operation where the resource is in the state but the engine does not receive a registration (this may behave even more interestingly if that resource is or is not targeted in a targeted operation -- see for an example of these kinds of interactions).
The following are examples of fixes for snapshot integrity issues that may serve as examples of applying the above principles and tracking down issues:
- Fix snapshot integrity on pending replacement
- Propagate deleted parents of untargeted resources
- Better handle property dependencies and
deletedWith
- Rewrite
DeletedWith
properties when renaming stacks
-
Snapshot integrity issues are generally "P1" issues, meaning that they are picked up as soon as possible in the development process. ↩︎