Pinning transitive R dependencies for fun and reproducible builds

August 9, 2019

Transitive dependencies (n)

  1. Packages that your dependencies depend on, and packages those packages depend on and so on and so on and so on and so on and so on
  2. The packages which are most likely to break your builds and are the most difficult to manage

Like many teams that work with large amounts of external software, we run into issues with our transitive dependencies. In general, transitive dependencies are a hard problem to solve. Many package managers (e.g. Cargo, NPM) and helper tools (e.g. pip-compile, Bundler) now generate a “lockfile” to guarantee reproducibility in installing transitive dependencies. The first time you install all of your dependencies, the tool runs its dependency resolution algorithm to determine which packages to install at what versions, taking into account the version requirements all packages have of other packages. The output of the dependency resolution is then stored in a lockfile (Gemfile.lock, cargo.lock, package-lock.json, etc.), which is stored alongside the file specifying dependencies. With subsequent installations, the same exact version of all dependencies are installed.

Our problem

While most of our business logic is written in Python, with its dependencies managed by “pip-compile”, we rely on some external R libraries for processing biological data. In the name of stability, we pinned our R dependencies using devtools to specify a version during installation. R’s default package installation scheme, however, doesn’t deal well with pinned versions of packages. Because there is no equivalent of a lockfile, devtools will install the specified pinned versions of packages, but the most recent versions of all other packages. This behaviour subverts the stability and reproducibility of pinning package versions, since transitive dependencies can update and break our builds. This is a recent example of an R build failing for exactly that reason:

After tons of digging, I tracked down the source of our build failure. It turns out, the latest vctrs update required rlang version 0.4.0 or greater, which caused R to skip installing vctrs since we pin rlang at 0.3.1. vctrs is 3 layers down the dependency graph from a package we explicitly install, but its update still broke our tests and prevented us from deploying. Eventually, I decided that we had to do something to prevent these errors moving forward.

Our solution

In thinking about the problem, we recognized that there were two distinct steps that a dependency-locking tool needs to accomplish: resolving version requirements and calculating an installation order.

Resolving version requirements

As I alluded to earlier, the approach that most package managers and tools take to resolve version requirements is to apply a dependency resolution algorithm. Dependency resolution algorithms are able to parse version requirements between packages and resolve which version of each package to install. Given how complicated the description is, dependency resolution algorithms can be complex. For that reason, we decided on a simpler approach: using package versions from an already-built docker container. Essentially, we are manually resolving dependencies, and then storing the output of that resolution for future use.

To get all of the installed R package versions on a system, you can use this R one-liner:

print(as.data.frame(installed.packages()[, c(1, 3:4)])[1:2])

It looks fairly complicated, but all that it does is to convert specific parts of “installed.packages” (a structure holding information about all installed R packages) to a data frame, then it does more filtering on the data frame and prints it out. What is nice about this is that the output format is easily machine readable, so we can convert it into a TSV file for use by other scripts. This is a simplified version of the script we use to run the R script on a specified docker container and convert the output to TSV:

docker run -it "$docker_image" Rscript -e "$(cat ./list_dependencies.R)" \
   | tr -d '\r'
   | tail -n+2
   | awk '{ print $1 "\t" $3 }'

We run the above R script inside $docker_image, then use “tr” to remove the windows line endings and “tail” to remove the column headers, then use “awk” to print out two of the columns with a tab in between. Although this requires a bit more manual input than a dependency resolution algorithm, it was much simpler to implement and was perfectly sufficient for our problem.

Resolving installation order

The other main problem that we had to address is installation order. Since our solution uses “devtools::install_version” to install packages at specific versions, we had to pay attention to the order in which we installed packages. Even if we install every single transient dependency manually, we can still run into issues if we install a package before its dependencies. While installing a package, R checks to see if its dependencies are installed, and if they aren’t already installed it automatically installs the latest version of that dependency. Which is a problem, because R automatically picking the latest version of dependencies is the exact problem we are trying to avoid in the first place.

To better understand the problem and our solution, it is helpful to think of this problem in terms of a dependency graph:

The dependency graph for the CRAN package “callr”

A dependency graph is a way of representing a package’s transitive dependencies based on which packages depend on each other. In the very simple example above, “callr” depends on “processx” and “R6”, “R6” doesn’t depend on anything, “processx” depends on “R6” and “ps”, and “ps” doesn’t depend on anything.

Now that we understand what a dependency graph is, it is easier to see what we mean by installing packages in the proper order. To install packages in the correct order, we look at the dependency graph and install packages “from the bottom up”. That is, we only want to install a package once all packages pointed to by that package have already been installed. This type of ordering is called a “topological ordering” in graph theory.

So that we can perform a topological sort on the graph to get a topological ordering, we have to ensure that the graph has two general properties. The first property is that the graph is “directed”, meaning that connections between nodes encode a direction, and are usually depicted as arrows. For dependency graphs, the arrows’ directions encode which package on each edge depends on the other. The second is that it is acyclic, meaning that there are no cycles in the dependency graph, since packages would fail to install if cycles existed because each package in the cycle would be waiting on another package in the cycle before it could install. Since both of these conditions hold in dependency graphs, we can perform a topological sort on the graphs to easily determine an installation order.

Once we can transform a dependency graph into an installation order, the rest of the code is easy. The general outline of what our code for installation order resolution does is the following:

  1. Scrape CRAN’s website for each package to find out what it “depends” on and what it “imports” (since there are two ways of specifying what we think of as a “dependency” in R).
  2. Use the information from CRAN to build up a dependency tree. This is different from a dependency graph, because each node only has one arrow pointing into it and packages are duplicated if multiple packages depend on them.
  3. Convert the dependency tree into a dependency graph.
  4. Apply a topological sort to the graph to get the installation order, using depth first search.

From there, we combine the installation order information with the version information to generate an R script like the following:

# Generated from cranlock
options(warn=2)
options(Ncpus=parallel::detectCores())
options(repos=structure(c(CRAN="https://cran.revolutionanalytics.com")))
devtools::install_version('codetools', version='0.2-15')
devtools::install_version('iterators', version='1.0.10')
devtools::install_version('foreach', version='1.4.4')
devtools::install_version('doParallel', version='1.0.11')
devtools::install_version('data.table', version='1.11.0')
devtools::install_version('getopt', version='1.20.3')
devtools::install_version('optparse', version='1.4.4')
devtools::install_version('lazyeval', version='0.2.1')

Conclusion

As long as you do it properly, pinning dependencies can save you a ton of time by drastically reducing surprise build failures.

Although we couldn’t find any, there could still be a more fully fleshed-out method of pinning R transitive dependencies out there. If there is, we would love to hear about it.

If you are interested in reading the full code, it is hosted on Github here: https://github.com/AlexsLemonade/cranlock

Transitive dependencies (n)

  1. Packages that your dependencies depend on, and packages those packages depend on and so on and so on and so on and so on and so on
  2. The packages which are most likely to break your builds and are the most difficult to manage

Like many teams that work with large amounts of external software, we run into issues with our transitive dependencies. In general, transitive dependencies are a hard problem to solve. Many package managers (e.g. Cargo, NPM) and helper tools (e.g. pip-compile, Bundler) now generate a “lockfile” to guarantee reproducibility in installing transitive dependencies. The first time you install all of your dependencies, the tool runs its dependency resolution algorithm to determine which packages to install at what versions, taking into account the version requirements all packages have of other packages. The output of the dependency resolution is then stored in a lockfile (Gemfile.lock, cargo.lock, package-lock.json, etc.), which is stored alongside the file specifying dependencies. With subsequent installations, the same exact version of all dependencies are installed.

Our problem

While most of our business logic is written in Python, with its dependencies managed by “pip-compile”, we rely on some external R libraries for processing biological data. In the name of stability, we pinned our R dependencies using devtools to specify a version during installation. R’s default package installation scheme, however, doesn’t deal well with pinned versions of packages. Because there is no equivalent of a lockfile, devtools will install the specified pinned versions of packages, but the most recent versions of all other packages. This behaviour subverts the stability and reproducibility of pinning package versions, since transitive dependencies can update and break our builds. This is a recent example of an R build failing for exactly that reason:

After tons of digging, I tracked down the source of our build failure. It turns out, the latest vctrs update required rlang version 0.4.0 or greater, which caused R to skip installing vctrs since we pin rlang at 0.3.1. vctrs is 3 layers down the dependency graph from a package we explicitly install, but its update still broke our tests and prevented us from deploying. Eventually, I decided that we had to do something to prevent these errors moving forward.

Our solution

In thinking about the problem, we recognized that there were two distinct steps that a dependency-locking tool needs to accomplish: resolving version requirements and calculating an installation order.

Resolving version requirements

As I alluded to earlier, the approach that most package managers and tools take to resolve version requirements is to apply a dependency resolution algorithm. Dependency resolution algorithms are able to parse version requirements between packages and resolve which version of each package to install. Given how complicated the description is, dependency resolution algorithms can be complex. For that reason, we decided on a simpler approach: using package versions from an already-built docker container. Essentially, we are manually resolving dependencies, and then storing the output of that resolution for future use.

To get all of the installed R package versions on a system, you can use this R one-liner:

print(as.data.frame(installed.packages()[, c(1, 3:4)])[1:2])

It looks fairly complicated, but all that it does is to convert specific parts of “installed.packages” (a structure holding information about all installed R packages) to a data frame, then it does more filtering on the data frame and prints it out. What is nice about this is that the output format is easily machine readable, so we can convert it into a TSV file for use by other scripts. This is a simplified version of the script we use to run the R script on a specified docker container and convert the output to TSV:

docker run -it "$docker_image" Rscript -e "$(cat ./list_dependencies.R)" \
   | tr -d '\r'
   | tail -n+2
   | awk '{ print $1 "\t" $3 }'

We run the above R script inside $docker_image, then use “tr” to remove the windows line endings and “tail” to remove the column headers, then use “awk” to print out two of the columns with a tab in between. Although this requires a bit more manual input than a dependency resolution algorithm, it was much simpler to implement and was perfectly sufficient for our problem.

Resolving installation order

The other main problem that we had to address is installation order. Since our solution uses “devtools::install_version” to install packages at specific versions, we had to pay attention to the order in which we installed packages. Even if we install every single transient dependency manually, we can still run into issues if we install a package before its dependencies. While installing a package, R checks to see if its dependencies are installed, and if they aren’t already installed it automatically installs the latest version of that dependency. Which is a problem, because R automatically picking the latest version of dependencies is the exact problem we are trying to avoid in the first place.

To better understand the problem and our solution, it is helpful to think of this problem in terms of a dependency graph:

The dependency graph for the CRAN package “callr”

A dependency graph is a way of representing a package’s transitive dependencies based on which packages depend on each other. In the very simple example above, “callr” depends on “processx” and “R6”, “R6” doesn’t depend on anything, “processx” depends on “R6” and “ps”, and “ps” doesn’t depend on anything.

Now that we understand what a dependency graph is, it is easier to see what we mean by installing packages in the proper order. To install packages in the correct order, we look at the dependency graph and install packages “from the bottom up”. That is, we only want to install a package once all packages pointed to by that package have already been installed. This type of ordering is called a “topological ordering” in graph theory.

So that we can perform a topological sort on the graph to get a topological ordering, we have to ensure that the graph has two general properties. The first property is that the graph is “directed”, meaning that connections between nodes encode a direction, and are usually depicted as arrows. For dependency graphs, the arrows’ directions encode which package on each edge depends on the other. The second is that it is acyclic, meaning that there are no cycles in the dependency graph, since packages would fail to install if cycles existed because each package in the cycle would be waiting on another package in the cycle before it could install. Since both of these conditions hold in dependency graphs, we can perform a topological sort on the graphs to easily determine an installation order.

Once we can transform a dependency graph into an installation order, the rest of the code is easy. The general outline of what our code for installation order resolution does is the following:

  1. Scrape CRAN’s website for each package to find out what it “depends” on and what it “imports” (since there are two ways of specifying what we think of as a “dependency” in R).
  2. Use the information from CRAN to build up a dependency tree. This is different from a dependency graph, because each node only has one arrow pointing into it and packages are duplicated if multiple packages depend on them.
  3. Convert the dependency tree into a dependency graph.
  4. Apply a topological sort to the graph to get the installation order, using depth first search.

From there, we combine the installation order information with the version information to generate an R script like the following:

# Generated from cranlock
options(warn=2)
options(Ncpus=parallel::detectCores())
options(repos=structure(c(CRAN="https://cran.revolutionanalytics.com")))
devtools::install_version('codetools', version='0.2-15')
devtools::install_version('iterators', version='1.0.10')
devtools::install_version('foreach', version='1.4.4')
devtools::install_version('doParallel', version='1.0.11')
devtools::install_version('data.table', version='1.11.0')
devtools::install_version('getopt', version='1.20.3')
devtools::install_version('optparse', version='1.4.4')
devtools::install_version('lazyeval', version='0.2.1')

Conclusion

As long as you do it properly, pinning dependencies can save you a ton of time by drastically reducing surprise build failures.

Although we couldn’t find any, there could still be a more fully fleshed-out method of pinning R transitive dependencies out there. If there is, we would love to hear about it.

If you are interested in reading the full code, it is hosted on Github here: https://github.com/AlexsLemonade/cranlock

Back To Blog