pkgr Details


pkgr is a rethinking of the way packages are managed in R. Namely, it embraces the declarative philosophy of defining ideal state of the entire system, and working towards achieving that objective. Furthermore, pkgr is built with a focus on reproducibility and auditability of what is going on, a vital component for the pharmaceutical sciences + enterprises.

For usage documentation, see the pkgr User Manual. For details about the motivation for pkgr, its design, and specific usage examples, read on.

Why pkgr?

install.packages and friends such as remotes::install_github have a subtle weakness -- they are not good at controlling desired global state. There are some knobs that can be turned, but overall their APIs are generally not what the user actually needs. Rather, they are the mechanism by which the user can strive towards their needs, in a forceably iterative fashion.

With pkgr, you can, in a parallel-processed manner, do things like:

  • Install a number of packages from various repositories, when specific packages must be pulled from specific repositories
  • Install Suggested packages only for a subset of all packages you'd like to install
  • Customize the installation behavior of a single package in a documentable and reproducible way

    • Set custom Makevars for a package that persist across system installations
    • Install source versions of some packages but binaries for others
  • Understand how your R environment will be changed before performing an installation or action.

Today, packages are highly interwoven. Best practices have pushed towards small, well-scoped packages that do behaviors well. For example, rather than just having plyr, we now use dplyr+purrr to achieve the same set of responsibilities (dealing with dataframes + dealing with other list/vector objects in an iterative way). As such, it is becoming increasingly difficult to manage the set of packages in a transparent and robust way.

pkgr in action

asciicast

Getting started

pkgr is a command line utility with several top level commands. The two primary commands are:

pkgr plan # show what would happen if install is run
pkgr install # install the packages specified in pkgr.config

The actions are controlled by a configuration file that specifies the desired global state, namely, by defining the top level packages a user cares about, as well as specific configuration customizations.

For example, a pkgr configuration file might look like:

Version: 1
# top level packages
Packages:
  - rmarkdown
  - bitops
  - caTools
  - knitr
  - tidyverse
  - shiny
  - logrrr

# any repositories, order matters
Repos:
  - MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"

# path to install packages to
Library: "<path/to/install/library>"

# package specific customizations
Customizations:
  Packages:
    - shiny:
        Suggests: true

When you run pkgr install with this as your pkgr.yml file, pkgr will download and install the packages listed in the Packages array, and any dependencies that those packages require.

If you want to see everything that pkgr is going to install before actually installing, simply run pkgr plan and take a look.

How about a more complex example? One such situation is the need to install from multiple repositories.

Here is a configuration that also pulls from bioconductor, which contains multiple CRAN-like repos that contain packages:

Version: 1
# top level packages
Packages:
  - magrittr
  - rlang
  - ggplot2
  - dplyr
  - tidyr
  - plotly
  - VennDiagram
  - aws.s3
  - data.table
  - forcats
  - preprocessCore
  - loomR
  - ggthemes
  - reshape

# any repositories, order matters
Repos:
  - MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"
  - BioCsoft: "https://bioconductor.org/packages/3.12/bioc"
  - BioCann: "https://bioconductor.org/packages/3.12/data/annotation"
  - BioCexp: "https://bioconductor.org/packages/3.12/data/experiment"
  - BioCworkflows: "https://bioconductor.org/packages/3.12/workflows"

# path to install packages to
Library: pkgs

Cache: pkgcache
Logging:
  all: pkgr-log.log
  install: install-only-log.log
  overwrite: true

The default behavior of pkgr is to find the first repository that contains the given package and use that. You can use Customizations to control that behavior at the Repos and Packages level.

For example, given the following, though dplyr is available in both repositories, thus would default to MPN, by setting the Repo in the package customization it will force dplyr to be installed from CRAN.

Version: 1
# top level packages
Packages:
- dplyr
- ggplot2

Repos:
  - MPN: "https://mpn.metworx.com/snapshots/stable/2020-12-21"
  - CRAN: "https://cran.rstudio.com"

Library: "test-library"

Customizations:
  Packages:
    - dplyr:
        Repo: CRAN

You can confirm this behavior by inspecting the debug output of the plan:

pkgr plan --loglevel=debug                                                      
INFO[0000] Installation would launch 16 workers
INFO[0000] R Version 3.6.3
INFO[0000] OS Platform x86_64-apple-darwin15.6.0
INFO[0000] Package Library will be created               path=test-library
INFO[0000] Default package installation type:  binary
INFO[0000] 1072:1073 (binary:source) packages available in for MPN from https://mpn.metworx.com/snapshots/stable/2020-12-21
INFO[0000] 16593:16772 (binary:source) packages available in for CRAN from https://cran.rstudio.com
INFO[0000] Package installation cache directory:  /Users/devinp/Library/Caches/pkgr
INFO[0000] Database cache directory:  /Users/devinp/Library/Caches/pkgr/r_packagedb_caches
DEBU[0000] package repository set                        pkg=dplyr relationship="user package" repo=CRAN type=binary version=1.0.2
DEBU[0000] package repository set                        pkg=ggplot2 relationship="user package" repo=MPN type=binary version=3.3.2
DEBU[0000] package repository set                        pkg=labeling relationship=dependency repo=MPN type=binary version=0.4.2
DEBU[0000] package repository set                        pkg=rematch2 relationship=dependency repo=MPN type=binary version=2.1.2
DEBU[0000] package repository set                        pkg=isoband relationship=dependency repo=MPN type=binary version=0.2.3
DEBU[0000] package repository set                        pkg=lifecycle relationship=dependency repo=MPN type=binary version=0.2.0
DEBU[0000] package repository set                        pkg=mgcv relationship=dependency repo=MPN type=binary version=1.8-33
.... TRUNCATED FOR WEBSITE
INFO[0000] package installation status                   installed=0 not_from_pkgr=0 outdated=0 total_packages_required=53
INFO[0000] package installation sources                  CRAN=1 MPN=52 tarballs=0
INFO[0000] package installation plan                     to_install=53 to_update=0
INFO[0000] Library path to install packages: test-library
INFO[0000] resolution time 223.09698ms

Notice on the dplyr pkg, the repo was CRAN instead of all others on MPN.

Once a package has been installed, pkgr will not touch that package unless you explicitly request it using the --update flag. Therefore, if you change your configuration after already installing a package (for example changing the repository), even if it detects a different version under the new plan, it will not override it unless --update is passed.

pkgr install --update

Be careful around leveraging this "feature" to manually build up a combination of package versions. It is much better to be explicit around your intent - namely by adjusting the Customizations to reflect the environment you want to maintain so others can reproduce your environment.

There are many other controls for pkgr, which can be seen in the pkgr User Manual.

installing stand-alone packages

pkgr can also install single packages that are not attached to a repository. This can be a convenient feature when you have a package internal for your use or your company that is not hosted anywhere. This way you can include it to be installed. pkgr will also automatically reconcile the dependencies needed to install it.

Tarballs:
- path/to/pkg.tar.gz

pkgr and packrat and renv and pak

how does it compare with pak can be read about here

Pkgr is not a replacement for Packrat/renv -- Pkgr is complementary to packrat/renv.

packrat/renv are tools to capture the state of your R environment and isolate it from outside modification. Where Packrat often falls short, however, is in the restoration said environment. Running packrat::restore() restores packages in an iterative fashion, which is a time-consuming process that doesn't always play nice with packages hosted outside of CRAN (such as packages hosted on GitHub). Additionally, since renv uses install.packages under the hood, each call to install.packages is still treated as an isolated procedure rather than as a part of a holistic effort. This means that the installation process does not stop and inform the user when a package fails to install properly. In this situation, renv/pkgr continues to install what packages it can without regard for how this might affect the package ecosystem when those individual installation failures are later resolved.

Pkgr solves these issues by:

  • Installing packages quickly in parallelized graph (determined by the dependency tree)
  • Allowing users to control things like what repo a given package is retrieved from and what Makevars it is built with
  • Showing users a holistic view of their R Environment (pkgr inspect --deps --tree) and how that environment would be changed on another install (pkgr plan)
  • Providing timely error messages and halting the installation process immediately when something goes wrong during the installation process (such as a package not being available, a repository being unreachable, etc.)