Building PyNest for data organization
We decided to build PyNest — a new Python “monorepo” for the data organization at Coinbase. It is not our intention for PyNest to be use as a monorepo for the entire company, but rather that the repository is used for projects within the data organization.
Building a company-wide monorepo requires a team of elites. We do not have enough crew to reproduce the success stories of monorepos at Facebook, Twitter, and Google.
Python is primarily used within the data org in the company. It is important to set the right scope so that we can focus on data priorities without being distracted by ad hoc requirements. The PyNest build infrastructure can be reused by other teams to expedite their Python repositories.
It is desirable to consolidate mutually dependent projects (see the dependency graph for ML platform projects) into a single repository to prevent inadvertent cyclic dependencies.
Figure 1. Dependency graph for machine learning platform (MLP) projects.
Although monorepo promised a new world of productivity, it has been proven not to be a long term solution for Coinbase. The Golang monorepo is a lesson, where problems emerged after a year of usage such as sprawling codebase, failed IDE integrations, slow CI/CD, out-of-date dependencies, etc.
Open source projects should be kept in individual repositories.
The graph below shows the repository architecture at Coinbase, where the green blocks indicate the new Python ecosystem we have built. Inter-repository operability is achieved by serving layers including the code artifacts and schema registry.
Figure 2. Repository architecture at Coinbase
Figure 3. Pynest repository structure
The following is a list of the major elements of the repository and their explanations.
1. 3rdparty
Third party dependencies are placed under this folder. Pants will parse the requirements.txt files and automatically generate the “python_requirement” target for each of the dependencies. Multiple versions of the same dependency are supported by the multiple lockfiles feature of Pants. This feature makes it possible for projects to have conflicts in either direct or transitive dependencies. Pants generates lockfiles to pin every dependency and ensure a reproducible build. More explanations of the pants multiple lock is in the dependency management section.
2. Lib
Shared libraries accessible to all the projects. Projects within PyNest can directly import the source code. For projects outside PyNest, the libraries can be accessed via pip installing the wheel files from an internal PyPI server.
3. Project folders
Individual projects live in this folder. The folder path is formatted as “{project_name}/{src or test}/python/{namespace}”. The source root is configured as “src/python” or “test/python”, and the underneath namespace is used to isolate the modules.
4. Code owner files
Code owner files (OWNERS) are added to the folders to define the individuals or teams that are responsible for the code in the folder tree. The CI workflow invokes a script to compile all the OWNERS files into a CODEOWNERS file under “.github/”. Code owner approval rule requires all pull requests to have at least one approval from the group of code owners before they can be merged.
5. Tools
Tools folder contains the configuration files for the code quality tools, e.g. flake8, black, isort, mypy, etc. These files are referenced by Pants to configure the linters.
6. Buildkite workflow
Coinbase uses Buildkite as the CI platform. The Buildkite workflow and the hook definitions are defined in this folder. The CI workflow defines the steps such as
Check whether dependency lockfiles need updating.
Execute lints and code quality tools.
Build source code and docker images.
Runs unit and integration tests.
Generates reports of code coverages.
7. Dockerfiles
Dockerfiles are defined in this folder. The docker images are built by the CI workflow and deployed by Codeflow — an internal deployment platform at Coinbase.
8. Pants libraries
This folder contains the Pants script and the configuration files (pants.toml, pants.ci.toml).
This article describes how we build PyNest using the Pants build system. In our next blog post, we will explain dependency management and CI/CD.