Third-party dependencies are organized into requirements.txt per project and Pants automatically generates the “python_requirement” target for each dependency. It also infers what dependencies are required for a Python file from the modules imported such as “import grpcio” which helps minimize the boilerplates in the BUILD scripts. The support from Pants 2.11 is an especially useful tool for managing conflicting dependencies in the repository.
Challenges of conflicting dependencies
As more projects are onboarded, conflicts in dependencies occur in the repository. These conflicts cause halts or delays on project progress and lower productivity by preventing developers from reusing code previously written.
There are usually two types of conflicts:
Direct dependency conflicts: two components of the code have directly conflicting dependencies. An example is
Project Star requires Dep < 2.0.0
Project Moon requires Dep >= 2.0.0
Transitive dependency conflicts: the conflicts are caused by a third party package’s transitive dependencies. The conflicts can be infeasible or prohibitively expensive to resolve directly, since we do not control how the third-party packages are specifying their dependencies. An example is
Project Star requires Foo >= 3.0.0 which requires Dep < 2.0.0
Project Moon requires Bar >= 2.0.0 which requires Dep >= 2.0.0
When such conflicts occur, it is impossible to resolve the dependencies for both Project Star and Project Moon and it is impossible for any other projects to depend on either project’s code.
Solutions to conflicting dependencies
We investigated a few options to solve for conflicting dependencies and share the benefits and drawbacks of each option below.
Option 1: Single-lock and force resolving conflicts
There is a single lockfile for the whole repository. Whenever new code introduces a conflicting dependency, the conflict must be addressed before the code can be merged.
Easy code sharing as the repository has no conflicts at any time.
Single source of truth provides the lowest complexity in dependency management.
More development time required to resolve the conflicts.
Uncertain time estimate to resolve the conflicts.
Negative user experience during code migration.sers will not be excited to update the code to resolve the conflicts.
Ongoing cost as a team is not able to use the latest libraries without upgrading projects owned by other teams.
The approaches to achieve this option include:
Try a different version of the conflicting package or loosen up constraints until the conflict goes away.
Update the code so that all the components with conflicting dependencies can rest on the same version.
Fork a copy of the third-party package to the repository and do the above steps. The above 2 steps cannot be directly applied when there are transitive dependency conflicts. Forking a copy of the third-party package ensures code in the repository always depends on the head of the package.
Option 2: Single-lock and leave out conflicting dependencies
This solution needs to be done with precautions and the tradeoffs can be serious. Conflicting dependencies will be removed from the lockfile and the requirements.txt if applicable. The just-deleted dependencies are then added to the BUILD scripts for those modules that depend on them - as seen in the following code.
Easiest to unblock an engineer when they introduce a conflicting dependency to the repository.
Degradation of engineering excellence and the introduction of future tech debts.
Code reuse becomes harder as conflicts exist and grow.
More complexity to the repository’s global dependency graph.
Makes builds slower if they involve dependencies that are not in the lockfile.
Lack of general visibility of the versions being in use and the corresponding scope.
It is important to note that we actively recommend against this solution due to the severity of the cons list.
Option 3: Unconstrained multiple lockfiles
New lockfiles can be added if applicable.
Can quickly unblock an engineer who introduces a new conflicting dependency.
Prevents code reuse whenever two code components are not on the same lockfile.
Difficult to educate users on how multiple lockfiles work and how to use it.
High maintenance cost due to the complexity of multiple lockfile. The code needs to have multiple copies if it cannot be shared across multiple lockfiles.
Option 4: Constrained multiple lockfile
We keep a small number (<= 3) of multiple locks. The choice to add a new lockfile should be made consciously, cautiously and requires approval from repository admins.
Mitigates against the cons of unconstrained multiple lockfiles in option 3.
Can allow a small number of “parallel universes” to form in the repository, in the case where it is prohibitively expensive to merge them.
The scopes dominated by different lockfiles are parallel universes and code sharing between them requires code copy.
No quick way to unblock an engineer who introduces a new dependency if the dependency conflicts with all the existing lockfiles.
After consulting with , we decided to go for a hybrid of option 2 and 4 for simplicity and flexibility. We keep only two lockfiles in the repository — one that is dedicated for the Airflow projects and the other one is a default lockfile for the other projects. We keep the total number of lockfiles to <= 3 to minimize the maintenance complexity.
Our strategy to handle conflicting dependencies is as follows:
Dependencies for the Airflow projects should be compatible and use the “airflow” lockfile.
Dependencies for the Non-Airflow projects will use the default lockfile.
Common libraries that are used by both Airflow and Non-Airflow projects should use the default and “airflow” lockfiles.
If a dependency still can not be resolved with the above steps, it can be left out of the lockfile system and should be approved by the repository admins.
The graph below shows the parallel universes for the “airflow” (blue) and the default (yellow) lockfiles in the repository. The Databricks client works with the “airflow” and default lockfiles and hence is shareable to both the Airflow and non-Airflow projects. The Airflow common SDK uses the “airflow” lockfile, so it is limited to the Airflow projects only. It is impossible for a yellow module to import a blue module such as the Airflow common SDK because they do not use the same lockfile.
CI/CD is the acronym for continuous integration and continuous delivery. It automates the processes of building code, checking code quality, running tests, and releasing build artifacts to production. CI/CD is a best practice for the devops and agile development. Because of the many benefits, CI/CD has been adopted by many companies and organizations. At Coinbase, we use the open source Buildkite to orchestrate CI pipelines and have built an in-house Codeflow for CD.
A Buildkite pipeline consists of steps for building code, checking quality, doing security scans and running tests. We made multiple efforts in optimizing the CI pipeline. For example, steps are parallelized as much as possible; tests that are irrelevant to the code changes are skipped, third-party dependencies are pre-installed on the CI Agent host to cut down the build time, etc. These improvements significantly reduced the execute time and the resource usage of the CI pool.
Many projects we developed have special requirements on the test environment. For example, some tests are memory-bound, ML tests may require special hardware like GPU, integration tests may require access to external systems like a DB, etc. To solve these problems, we categorized tests into different groups. We use build tags in the test BUILD script to indicate which category a test belongs to so that the test can be run on the appropriate CI Agent docker with the required dependencies and configurations:
In the CI pipeline, we add a step for each of the test categories. The CI step uses the Pants filter command to select the test targets with the designated tag for execution. Below is the pseudo code that queries the integration test targets for execution:
For each commit merged to the master branch, Codeflow schedules a Buildkite pipeline to generate a build for the commit and publish the outcome to the code artifactory. The code artifacts uploaded include the following types:
Compressed folder of Airflow DAGs. The DAG source code is compressed into TAR files which will be synchronized to the Airflow clusters for execution.
Compressed job folders that will be fetched by the job management layer and submitted to Databricks or Ray clusters for execution.
Wheel files are built for PySpark, Ray, and multiple other applications.
PEX files are self-contained with all the dependencies and used by multiple applications, including ML applications, services, RAY jobs, etc.
Success stories and ongoing works
It takes less than one year for PyNest to become a model repository in the company. At the time of writing, there are 400+ builds kicked off on the master branch every week. The P95 CI execution time is < 15 min during the peak hours. The reliability as measured by the ratio of successful builds on the master branch is > 97%. The usage numbers are still growing at an accelerated rate, and the CI/CD infrastructure we built scales well to the demanding load.
As people are able to work productively in PyNest, we are happy to see many impactful projects launched in the past year. A few selected highlights are:
ML training and serving pipelines and the associated applications such as risk predictions, feed recommendations, web3 models, and more..
An in-house launch of Airflow 2 platform running hundreds of pipelines covering ETL, crypto analytics, market data calculations, and NFT indexing etc.
A Spark platform that powers the streaming/batch computations for applications such as ML feature store, NFT / web3 data processing, and more..
We also set up a long term collaboration with a team from Toolchain labs to improve the PyNest even further.l. A few ongoing projects include:
Set up a remote build cache to expedite the build speed.
Integrate with Toolchain lab’s BuildSense web console for diagnosis and monitoring of the build system.
Support JVM build to meet the demanding requirements for Java / Scala projects.
We chose Pants as the underlying build system for the Python ecosystem which greatly expedited the development speed. Without a good Python ecosystem, it would have been impossible to launch so many critical projects within a year.
We would like to thank our friends Benjy Weinberger, Stu Hood, and Eric Arellano from the Toolchain labs for their help, suggestion, and tech talk with our engineers. We would also like to thank Michael Li, Leo Liang, and Eric Sun for sponsoring the project internally. There are many engineers and data scientists who generously provided feedback and made important contributions. The project will not be successful without your help.
Nov 28, 2023
Nov 27, 2023