Pipenv vs Conda (for Data Scientists)
A comparison of pipenv and conda as of Jan 2021 based on various “data science-ish” criteria
Introduction
Python has many tools available for distributing code to developers and does not adhere to “There should be one — and preferably only one — obvious way to do it”. For example Conda+Anaconda is recommended by scipy.org which manages the ubiquitous scipy stack, whilst pipenv+PyPI is recommended by PyPA, the python packaging authority. Which could leave data scientists in a bit of a quandary. This article compares pipenv and conda as of Jan 2021 using the following set of criteria, some of which are more relevant to data scientists:
Package Availability
Dependency resolution
Python version
Dependency Specification
Disk Space
Security
Longevity
Customisation
Miscellaneous
The article does not recommend one tool over another but should help the reader make a decision based on their needs. The article assumes the reader is already familiar with the python packaging ecosystem, pipenv and conda. For those less familiar, I have also included a list of useful resources at the end of the article.
Package Availability
Are packages available in the appropriate format?
As stated by Anaconda, “over 1500 packages are available in the Anaconda repository, including the most popular data science, machine learning, and AI frameworks. These, along with thousands of additional packages available on Anaconda cloud from channeling including conda-forge and bioconda, can be installed using conda. Despite this large collection of packages, it is still small compared to the 150,000 packages available on PyPI”. On the other hand, not all packages in PyPI are available as wheels, which is especially problematic for data science libraries which usually require C/C++/Fortran code. Whilst it is possible to install PyPI packages using pip in conda environments, this requires all the sub-dependencies to be pip packages themselves too and can cause headaches so is not recommended. There is usually a delay between packages being available in Anaconda main channel compared to PyPI. For example the delay for pandas seems to be a few weeks.
I wanted to check if pipenv+PyPI and conda+Anaconda could provision a data scientist’s basic tool set: pandas, scikit-learn, sqlalchemy, jupyter, matplotlib and networkx. I used python3.8 because 3.9 came out just recently.
$ pipenv install pandas scikit-learn sqlalchemy jupyter matplotlib networkx --python 3.8$ conda create --name env_ds scikit-learn sqlalchemy jupyter matplotlib networkx python=3.8
Both environments were successfully created in about 3 minutes. Note that I am using Ubuntu WSL1, different platforms might not be as successful in creating the environments.
Dependency resolution
Resolving direct and indirect dependencies
Conda
To test this criteria I used pandas which has a dependency on numpy. I first attempted to install numpy1.15.3 and pandas using conda, so that the environment has a direct dependency on pandas and numpy and indirect dependency on numpy:
$ conda create --name env_a numpy==1.15.3 pandas python=3.7
Conda is successful at creating an environment and installs pandas1.0.5 which is the last pandas version to support numpy1.15.3.
If the package version of an existing environment requires upgrading or downgrading:
$ conda create --name env_b pandas python=3.7
$ conda activate env_b
$ conda install numpy==1.15.3
Conda will ask you before updating the environment:
The following packages will be DOWNGRADED:
numpy 1.19.2-py37h54aff64_0 → 1.15.3-py37h99e49ec_0
numpy-base 1.19.2-py37hfa32c7d_0 → 1.15.3-py37h2f8d375_0
pandas 1.2.0-py37ha9443f7_0 → 1.0.5-py37h0573a6f_0Proceed ([y]/n)?
Note that it is recommended to specify all packages at the same time to help Conda resolve dependencies.
Pipenv
I then attempted to install the same packages with pipenv:
$ pipenv install numpy==1.15.3 pandas --python 3.7
Pipenv creates an environment using numpy1.19.1, which does not meet my specification. Pipenv determines that there are conflicts, is unable to create a Pipfile.lock and prints the following useful message:
✘ Locking Failed!
There are incompatible versions in the resolved dependencies:
numpy==1.15.3 (from -r /tmp/pipenvzq7o52yjrequirements/pipenv-5bf3v15e-constraints.txt (line 3))
numpy>=1.16.5 (from pandas==1.2.0->-r /tmp/pipenvzq7o52yjrequirements/pipenv-5bf3v15e-constraints.txt (line 2))
Pipenv also has the graph and graph-reverse commands which prints the dependency graph and allows users to trace how package depend on each other and helps resolve conflicts.
$ pipenv graph
pandas==1.2.0
-numpy [required: >=1.16.5, installed: 1.19.5]
-python-dateutil [required: >=2.7.3, installed: 2.8.1]
— six [required: >=1.5, installed: 1.15.0]
-pytz [required: >=2017.3, installed: 2020.5]
Note that the pip dependency resolver is going through changes. I used the latest version (20.3.1) but the outcome might vary depending on the pip version.
Python version
Managing different python versions
Conda
Conda will treat the python distribution like a package and automatically install any python version that you have directly specified. Moreover when creating a new environment, conda will determine the best python version (if not specified). For example:
$ conda create —-name env_a pandas
creates an environment with python3.8.5 and pandas1.1.5 but
$ conda create —-name env_c pandas==0.25.0
creates an environment with python3.7.9 which is the last python version to support pandas0.25.0.
The install will fail if it requires upgrading/downgrading the python version of an existing environment:
$ conda create —-name env_d python==3.8
$ conda activate env_d
$ conda install pandas==0.25.0
but the error message is very helpful:
UnsatisfiableError: The following specifications were found
to be incompatible with the existing python installation in your environment:
Specifications:
- pandas==0.25.0 -> python[version=’>=3.6,< 3.7.0a0|>=3.7,< 3.8.0a0']
Pipenv
Pipenv does not natively install different python versions. It will use the system python (usually stored in /usr/lib) or the base python (usually stored in ~/miniconda3/bin if miniconda is installed) to create new environments. However pipenv can use pyenv to install other python versions if pyenv is installed. You can use pyenv to pre-install python versions, or pipenv will ask you to install a python version if it’s not already available locally:
https://towardsdatascience.com/python-environment-101-1d68bda3094d
Unfortunately pipenv+pyenv cannot resolve the best python version, even when creating a environment from scratch. For example:
$ pipenv install pandas
creates an environment with python3.8.5 and pandas1.2.0. Attempting to install pandas0.25.0 where the default pyenv python version is 3.8 stalls:
$ pipenv install pandas==0.25.0
Note that the stalling is probably due to how the requirements for pandas0.25.0 were configured. pip relies on the python_requires attribute to determine if the python version is suitable, which is a recent addition. Attempting to install more recent packages where the python_requires attribute is not met usually fails with a “distribution not found” error. Note that pipenv will also attempt to install the latest version of a package if unspecified, regardless of the python version. For example attempting to pandas in a python3.5 environment:
$ pipenv install pandas --python 3.5
will fail with the following error message:
[pipenv.exceptions.InstallError]: ERROR: Could not find a version that satisfies the requirement pandas==1.1.5
[pipenv.exceptions.InstallError]: ERROR: No matching distribution found for pandas==1.1.5
This message is not very helpful and has been raised as an issue with pip.
Dependency Specification
Ensuring a reproducible build that is upgradable
Pipenv uses two files to specify dependencies: Pipfile for direct dependencies and Pipfile.lock for both direct and indirect dependencies. Creating an an environment using the Pipfile.lock ensures that exactly the same packages will be installed, including the hash of the package. Creating an environement using the Pipfile gives it the flexibility to upgrade indirect dependencies if required. Pipenv hopes that the Pipfiles will replace requirements.txt in the future (see https://github.com/pypa/pipfile).
Conda uses an environment.yaml file to specify both direct and indirect dependencies. Users have to use trial and error when updating their environments. There is a conda-lock library which replicates the Pipfile.lock ability but it is not currently supported by Anaconda.
Disk Space
How much space do environments take up? Can sharing help?
Python environments used by data scientists tend be large, especially conda environments. For example a conda environment with jupyter and pandas takes up 1.7GB, whilst an equivalent pipenv environment takes up 208MB. Whilst not relevant to most development environments, this may become more important in production, for example when using containers:
https://towardsdatascience.com/how-to-shrink-numpy-scipy-pandas-and-matplotlib-for-your-data-product-4ec8d7e86ee4
Because of their large size, data scientists often use a conda environment across multiple exploratory projects, or even across multiple production projects which are part of the same solution:
https://stackoverflow.com/questions/55892572/keeping-the-same-shared-virtualenvs-when-switching-from-pyenv-virtualenv-to-pip
The conda environment can be created, activated and used from any location.
A pipenv environment is tied to a project repository. Once created, Pipenv saves the pipfiles to the root of the repository. The installed packages are saved to ~/.local/share/.virtualenvs / by default, where pipenv ensures that one environment is created per repo by creating a new directory and appending a hash of the path to the name (i.e. my_project-a3de50
). The user must cd to the root of the project repository to activate the environment, but the shell will remain activated even if you leave the directory. It is possible to share an environment across multiple projects by storing the Pipfiles in a separate directory. The user must then remember to cd to the repository to activate and update the environment.
Security
How safe are packages to install?
The Anaconda main channel https://anaconda.org/anaconda/ is maintained by Anaconda employees and packages go through a strict security check before uploading. In the case of pipenv which uses PyPI, anyone can upload any package and nefarious packages have been found in the past (see https://www.zdnet.com/article/twelve-malicious-python-libraries-found-and-removed-from-pypi/). The same goes with conda-forge although they are developing a process to validate artifacts before they are uploaded to the repository.
Work-arounds include:
- Perform security checks using tools like x-ray https://jfrog.com/xray/
- Only install packages which are at least a month old to give enough time for issues to be found and resolved
Longevity
Is conda/pipenv here to stay? How mature is it? Who supports it?
Pipenv was first introduced in 2017 by the creator of the popular requests library. Pipenv did not release any new code between Nov 2018-May 2020 which raised some concern about its future:
https://medium.com/telnyx-engineering/rip-pipenv-tried-too-hard-do-what-you-need-with-pip-tools-d500edc161d4
https://chriswarrick.com/blog/2018/07/17/pipenv-promises-a-lot-delivers-very-little/
Pipenv has now been picked up by new developers and is being updated more regularly with monthly releases since May 2020.
Conda/Anaconda was created in 2012 by the same team behind scipy.org which manages the scipy stack. Conda is an open source tool but the anaconda repository is hosted by Anaconda Inc., a for-profit organisation. Whilst this means conda/anaconda is unlikely to disappear anytime soon, this has raised concern that Anaconda Inc. might start charging users. They have recently changed their terms of conditions to charge heavy or commercial users which includes mirroring the anaconda repository. Note that the new terms of condition does not apply to the conda-forge channel.
Customisation
What advantages does a custom package manager bring?
Conda/Anaconda was created by the python scientific community to solve problems specific to their community, such as non-python dependencies:
http://technicaldiscovery.blogspot.com/2013/12/why-i-promote-conda.html
This gives it the flexibility and impetus to create products geared for Data Scientists.
Conda can distribute non-Python build requirements, such as gcc
, which greatly streamlines the process of building other packages on top of the pre-compiled binaries it distributes. Conda can also install R packages. Anaconda developed MKL-powered binary versions of some of the most popular numerical/scientific Python libraries. These have been shown to lead to significant improvements in performance. Whilst MKL optimizations are no longer in production, Anaconda could still develop tools that are only compatible with a conda environment.
Packaging
How is code packaged up?
Both conda and pipenv rely on additional tools for packaging code. Both also rely on following “recipes” depending on whether the code contains non-python code and the target platform.
Conda-build is used to create conda packages:
https://docs.conda.io/projects/conda-build/en/latest/
PyPA recommends using setuptools to build wheels that can be installed using pipenv. Below is a great overview:
https://realpython.com/python-wheels/
Note that python packaging is expected to change a lot in the future with the introduction of pyproject.toml file and PEP518:
https://grassfedcode.medium.com/pep-517-and-518-in-plain-english-47208ca8b7a6
Miscellaneous
Any other factors to consider?
- Conda resolves and prints what packages will be installed before installing them, giving users the opportunity to proceed or reconsider before going through the lengthy installation procedure
- Changing the name/path of the project directory breaks the pipenv environment and a new environment is automatically created (see
https://github.com/pypa/pipenv/issues/796) - Conda does not automatically create/update the environment.yaml file, unlike pipenv which updates the Pipfile. Hence it is possible for your environment and environment.yaml file to become out of synch if you forget to update your environment.yaml file
Useful Resources
A review of the python packaging ecosystem
https://packaging.python.org/overview/
https://towardsdatascience.com/packaging-in-python-tools-and-formats-743ead5f39ee
A guide to pipenv
https://realpython.com/pipenv-guide/
A guide to conda/Anaconda for data scientists
(Whist geared for Windows the theory is relevant to any OS)
https://realpython.com/python-windows-machine-learning-setup/
A comparison of conda and pip
https://jakevdp.github.io/blog/2016/08/25/conda-myths-and-misconceptions/
https://www.anaconda.com/blog/understanding-conda-and-pip
Ensuring a reproducible build, and still be able to quickly change your dependencies
https://pythonspeed.com/articles/conda-dependency-management/
Options for packaging your Python code
https://pythonspeed.com/articles/distributing-software/