Sorry, our demo is not currently available on mobile devices.

Please check out the desktop version.
You can learn more about Stemma on our blog.
See Our Blog
June 3, 2021
March 18, 2021
-
5
min read

Amundsen deployment best practices

by
Dorian Johnson
Co-founder, CTO of Stemma
Share This Article

Almost any organization using Amundsen will need to make custom changes to their install. Unfortunately, this has been a long-time issue for the community (see this Slack thread). This post is the first in a step-by-step guide to getting a fully customized enterprise deployment of Amundsen¹, based on how Stemma deploys Amundsen for its customers.

This guide is intended for engineers at organizations intending to install Amundsen. If you’re an individual evaluating Amundsen, follow the installation guide instead, it will be much simpler.

To zoom out a bit, the aspects of a fully-managed Amundsen include:

  1. Source Control and build: getting a git repository set up that contains your configurations and code modifications to build into Docker containers
  2. Service deployment: deploying your containers onto ECS, Kubernetes, or docker-compose.
  3. Ingestion: Configuring databuilder to connect to other parts of your infrastructure and ingest metadata.
  4. Security: Enabling authentication, and fire-walling services.
  5. Ops: Backup and Restore, Monitoring, Upgrades

This post will focus on #1. We’ll follow-up with guides for the rest in the future!

We want to hear from you how well this works for you in practice! Please find us on the Amundsen community Slack (we’re @Verdan (Stemma) or @dorian (Stemma), or email us at hello@stemma.ai

Quick Start Guide

We’ve packaged this entire guide as a git repo that you can fork and immediately get to work². If you’re just starting out, or you’re not happy with your current Amundsen deploy set-up, we recommend just forking that repo and skipping the rest of these instructions.

Please star and report bugs if you run into any! We’re hoping to get some initial feedback from users, and if it’s well-received, we’ll work to bring this upstream into the Amundsen project.

The rest of this post digs into the rationale behind that repository we made the decisions we made, and to allow you to mix-and-match pieces if you’re modifying an existing Amundsen deployment, or are debugging the repo.

Detailed instructions

Our overall goal here is to create a structure to have deployable Docker images that allow for the range of configurations allowed by Amundsen. For many services, you could use the provided Docker images and configure via environmental variables. Unfortunately, that’s not possible with Amundsen today, for two reasons:

  1. The front-end configuration is done via a Typescript configuration file that must be built into the output Javascript bundle, meaning you cannot use the pre-built Docker container for the front-end.
  2. For each micro-service, while some configuration is exposed via environmental variables, others require Python source files to be included in the Docker image.

This situation has pros and cons. It means that even simple Amundsen installation will require at least some custom code. However, it also means that even for highly customized installs, it typically does not require outright forking, unless you add substantial new functionality.

We’ve observed that in most installs, users will need to customize the front-end to even get Amundsen ready for a proof of concept. Customizations to the metadata service or search service aren’t always immediately needed. However, we’ve found that it’s easier to set up this architecture up-front in one go, rather than making the change as-needed.

Microservices

For each service, you’ll need:

  • A custom config.py for configuration changes, under a configs directory. Some options can be set as env vars at deploy time, but many cannot. This can start as an empty config, but eventually, this config file will allow you to add endpoints, adding/updating request headers across the microservices, or change authentication.
  • A requirements.txt for additional requirements (e.g. to use OIDC or LDAP). This can also start empty.
  • A reference to the upstream package. It’s desirable to be able to point to a specific git hash,
  • If you need to make deeper changes to upstream, e.g. developing wholesale new features, this is where you fork the upstream repo and get your changes integrated into your build.
  • A custom wsgi.py to tie all of the above together. This can be copied from the upstream repo and is light enough that it shouldn't be a burden to have forked.
  • A Dockerfile to build all of the above into an image. This is best copied out of our example repo, but there’s nothing advanced happening here.

There are three microservices: frontend, metadata, and search. Once you have those pieces in place, you’ll have something like this:


Front-end customization

For the front-end, you’ll need a Typescript config file that’s copied into the build directory during Typescript compilation, named config-custom.ts. You may want to be able to modify specific files in order to do custom branding or other minor modifications without doing a wholesale fork. We've got you covered: there's a supported script to make this relatively painless named static_build.sh (thanks to Daniel Won for contributing that script). It copies the upstream and merges in all files from a folder into a new temporary folder before running the Typescript compilation. This means if you arrange the directory structure identically to the upstream structure, you can replace any js or css file at will:


That gives you the ability to overwrite any file wholesale. It doesn’t currently support patch files, so this can get annoying at scale. If you need to make major changes, fork the upstream repo and deal with the occasional merge conflicts. (Even better: contribute it upstream to the open source project, the community is very welcoming of patches 😄)

Deploy

This setup results in Docker containers allowing an easy deploy anywhere you normally deploy containers. This workflow is compatible with most CI/CD pipelines; we use Github actions to rebuild all of these containers on each commit, then we push them to ECR. For Kubernetes, the Amundsen repo contains a Helm chart that allows you to easily swap in your prebuilt images, and it takes care of the rest. We’ll be back soon with info on how to use that!

  1. It aims to cover 99% of installations, with customizations to the data integrations, front-end, and authorization. This guide does not attempt the 1% of installs that require forking existing files inside Amundsen. Generally, we consider cases where that is necessary to be non-supported. If you try to add functionality and can’t do it using the existing hooks, please open an issue! Even if your modifications aren’t appropriate to mainline, we’re happy to consider adding an insertion point that will allow your modification to work without forking.
  2. The Amundsen community is awesome and provides a lot of help if you get stuck. But if you don’t want to think about it at all, Stemma provides managed Amundsen. We’ll take care of all of this for you, and more. Get in touch if you’re interested, hello@stemma.ai.
Share This Article
Amundsen
Announcement
Stay in the loop by subscribing to our newsletter
Thank you! Your submission has been received!
Oops! Something went wrong while submitting the form.

Next Articles

June 3, 2021
-
5
min read

Stemma: Helping you trust your data

Today, we are excited to announce the launch of Stemma - a fully managed data catalog, powered by Amundsen, the leading open-source data catalog with the largest community and broadest adoption. We raised $4.8M in seed funding led by Sequoia to bring the power of the leading open-source data catalog to every organization.

June 3, 2021
December 1, 2020
-
7
min read

The data production- consumption gap

All recent innovation in data has taken place in two areas — helping data engineers produce data, and helping data consumers (primarily data analysts and scientists) consume that data.