Structure¶
This page gives a guide to where things belong within the cookiecutter structure.
A direct tree representation of the folder hierarchy is also given at the bottom.
Here are a couple of examples from projects:
Note: In the following sections we use src/
to denote the project name to avoid awkward <project_name>
placeholders.
Project configuration¶
Once you have spun up your project, you will need to configure it for use. This will require setting up some of the tools described below. While the cookiecutter sets up the project structure and will activate the git repo in the project folder, nothing else will be automatically activated. You will need to:
- Set up your virtual environment using your selected tool, such as
uv
,poetry
, orconda
. - Set up pre-commit using
pre-commit install --install-hooks
. - Activate
direnv
if you need environment variables in your shell, usingdirenv allow
.
All of these steps are required after a cookiecutter is setup, but critically, are the first steps we should take when working in any repository, whether it's a new project or a clone of a colleague's!
Git hooks¶
We use pre-commit to check the integrity of git commits before they happen. You should set this up with your virtual environment. With uv
, it's as simple as adding it to the environment, sourcing the environment, and running pre-commit install --install-hooks
.
uv add pre-commit
source .venv/bin/activate
pre-commit install --install-hooks
The steps are specified in .pre-commit-config.yaml
. Some basic checks are performed on the repository (ensuring no large files commited and fixing trailing whitespace) from a core set of pre-commit hooks. More critically, we use ruff to format and lint the repository, an all-in-one alternative to black, flake8, and isort.
Now, every time you run git commit
, it should perform these checks automatically
Warning: You need to run git commit
with your virtual environment activated. This is because by default the packages used by pre-commit are installed into your project's environment. (note: pre-commit install --install-hooks
will install the pre-commit hooks in the currently active environment, which is why we source it before running in the example above).
Reproducable environment¶
The first step in reproducing someone else’s analysis is to reproduce the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.
By listing all of your requirements in the repository you can easily track the packages needed to recreate the analysis. We recommend the use of uv to manage the requirements in your project, but you are open to choose other tools such as poetry, Python's built in venv, or conda.
Whilst popular for scientific computing and data science, conda poses problems for collaboration and packaging, so we recommend moving away from its use if at all possible.
Files¶
The files used to track dependencies in a virtual environment will vary based on the tool you use. For uv
and poetry
, the entire project will be tracked using the pyproject.toml
file. The pyproject.toml
file has already been pre-filled in the cookiecutter, but you should learn how it works so you can make edits and changes as necessary within your projects.
You can add or remove dependencies directly from the command line, using uv add pandas
or uv remove numpy
to automatically edit the pyproject.toml
file. You can add development dependencies with the --dev
flag like so uv add --dev ipykernel
. You can substitute poetry
for uv
depending on the tool you're using.
If using conda
or Python's built-in environment manager, you may manually manage requirements in text files, like requirements.txt
and requirements_dev.txt
. conda
will also use an environment.yaml
file to define the base conda environment and any dependencies not "pip-installable".
Commands¶
You will need to create your virtual environment, install the packages, and activate it using the tool you've selected. Previously, make install
helped with this setup and updating. Using uv
, you can run the following:
uv venv
to create a virtual environment in the.venv
folderuv add package
to add dependenciesuv sync
to ensure your virtual environment is synced with theuv
lockfile, especially useful when entering new projectsuv pip install -e .
to install the project to your environment in an editable formatsource .venv/bin/activate
to activate the virtual environment, oruv run script.py
to directly run any Python script in the environment
These are just some of the commands you might want to run, if you chose uv
. You should learn and explore whatever tool you choose so you can confidently manage your projects and its dependencies.
Secrets and configuration - .env.*
and src/config/*
¶
You really don't want to leak your AWS secret key or database username and password on Github. To avoid this you can:
Store your secrets in a special file¶
Create a .env
file in the project root folder. Thanks to the .gitignore
, this file should never get committed into the version control repository.
Here's an example:
# example .env file
DATABASE_URL=postgres://username:password@localhost:5432/dbname
OTHER_VARIABLE=something
We also have .envrc
which contains non-secret project configuration shared across users such as the bucket that our input data is stored in.
direnv
automatically loads .envrc
(which itself loads .env
) making our configuration available. Add all environment variables directly to .env
so they are available in Python through dotenv
as well as in your terminal through direnv
. You will need to activate direnv
in the repository by running direnv allow
.
Store Data science configuration in src/config/
¶
If there are certain variables that are useful throughout a codebase, it is useful to store these in a single place rather than having to define them throughout the project.
src/config/base.yaml
provides a place to document these global variables.
For example, if you were working on a fuzzy-matching the PATSTAT patent database to the Companies House database and wanted to only merge above a certain match score you may add a section to the configuration like the following,
patstat_companies_house:
match_threshold: 90
and load that value into your code with,
from src import config
config["patstat_companies_house"]["match_threshold"]
This centralisation provides a clearer log of decisions and decreases the chance that a different match threshold gets incorrectly used somewhere else in the codebase.
Config files are also useful for storing model parameters. Storing model parameters in a config makes it much easier to test different model configurations and document and reproduce your model once it’s been trained. You can easily reference your config file to make changes and write your final documentation rather than having to dig through code. Depending on the complexity of your repository, it may make sense to create separate config files for each of your models.
For example, if training an SVM classifier you may want to test different values of the regularisation parameter ‘C’. You could create a file called
src/config/svm_classifier.yaml
to store the parameter values in the same way as before.
Note - as well as avoiding hard-coding parameters into our code, we should never hard-code full file paths, e.g. /home/Projects/my_fantastic_data_project/outputs/data/foo.json
, as this will never work on anything other than your machine.
Instead use relative paths and make use of src.PROJECT_DIR
which will return the path to your project's base directory. This means you could specify the above path as f"{src.PROJECT_DIR}/outputs/data/foo.json"
and have it work on everyone's machine!
Data - inputs/data
, outputs/data
, outputs/.cache
¶
Generally, don't version control data (inputs or outputs) in git, it is best to use s3 (directly or through metaflow) to manage your data.
inputs/data
¶
Put any data dependencies of your project that your code doesn't fetch here (E.g. if someone emailed you a spreadsheet with the results of a randomised control trial).
Don't ever edit this raw data, especially not manually or in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable.
Ideally, you should store it in AWS S3. You can then use the ds-utils package, which has a neat way of pulling in data into dataframe. Alternatively, if you set the S3_INPUT_PATH
environment variable (e.g. in .env
) then you can use make inputs-pull
to pull data from the configured S3 bucket.
outputs/.cache/
¶
This folder is for ephemeral data and any pipeline/analysis step should be runnable following the deletion of this folder's contents.
For example, this folder could be used as a file-level cache (careful about cache invalidation!); to download a file from the web before immediately reading, transforming, and saving as a clean file in outputs/data
; or to temporary data when prototyping.
outputs/data
¶
This folder should contain transformed/processed data that is to be used in the final analysis or is a data output of the final analysis. Again, if this data is sensitive, it is always best to save on S3 instead!
Try to order this folder logically. For example, you may want subfolders organised by dataset, sections of analysis, or some other hierarchy that better captures your project.
Fetching/loading data - src/getters
¶
This folder should contain modules and functions which load our data. Anywhere in the code base we need to load data we should do so by importing and calling a getter (except prototyping in notebooks).
This means that lots of calls like pd.read_csv("path/to/file", sep="\t", ...)
throughout the codebase can be avoided.
Following this approach means:
- If the format of
path/to/file
changes then we only have to make the change in one place - We avoid inconsistencies such as forgetting to read a column in as a
str
instead of anint
and thus missing leading zeros - If we want to see what data is available, we have a folder in the project to go to and we let the code speak for itself as much as possible - e.g. the following is a lot more informative than an inline call to
pd.read_csv
like we had above
Here are two examples:
# File: getters/companies_house.py
"""Data getters for the companies house data.
Data source: https://download.companieshouse.gov.uk/en_output.html
"""
import pandas as pd
def get_sector() -> pd.DataFrame:
"""Load Companies House sector labels.
Returns:
Sector information for ...
"""
return pd.read_csv("path/to/file", sep="\t", dtype={"sic_code": str})
or using ds-utils:
#File: getters/asq_data.py
"""Data getters for the ASQ data.
"""
import pandas as pd
from nesta_ds_utils.loading_saving import S3
def get_asq_data() -> pd.DataFrame:
"""Load ASQ data for assessments taken in 2022.
Returns: Dataframe of the ASQ data at individual level including information on …
"""
return S3.download_obj(
bucket="bucket_name",
path_from="data/raw/data_asq.csv",
download_as="dataframe",
kwargs_reading={"engine": "python"},
)
Pipeline components - src/pipeline
¶
This folder contains pipeline components. Put as much data science as possible here.
We recommend the use of metaflow to write these pipeline components.
Using metaflow:
- Gives us lightweight version control of data and models
- Gives us easy access to AWS batch computing (including GPU machines)
- Makes it easy to take data-science code into production
Shared utilities - src/utils
¶
This is a place to put utility functions needed across different parts of the codebase.
For example, this could be functions shared across different pieces of analysis or different pipelines.
Analysis - src/analysis
¶
Functionality in this folder takes the pipeline components (possibly combining them) and generates the plots/statistics to feed into reports.
It is easier to say when shomething shouldn't be in analysis
than when something should: If one part in analysis
depends on another, then that suggests that the thing in common is likely either a pipeline component or a shared utility (i.e. sections of analysis
should be completely independent).
It is important that plots are saved in outputs/
rather than in different areas of the repository.
Notebooks - src/analysis/notebooks
¶
Notebook packages like Quarto are effective tools for exploratory data analysis, fast prototyping, and communicating results; however, between prototyping and communicating results code should be factored out into proper python modules. Compared with Jupyter notebooks, Quarto .qmd
files are Markdown based text files that render outputs to separate files, making it easier to code review the raw text and ensure sensitive outputs are not version controlled.
We have a notebooks folder for all your notebook needs! For example, if you are prototyping a "sentence transformer" you can place the notebooks for prototyping this feature in notebooks, e.g. analysis/notebooks/sentence_transformer/
or analysis/notebooks/pipeline/sentence_transformer/
.
Please try to keep all notebooks within this folder and primarily not on github, especially for code refactoring as the code will be elsewhere, e.g. in the pipeline. However, for collaborating, sharing and QA of analysis, you are welcome to push those to github. Make sure that you do not save rendered files (e.g. HTML) directly to GitHub.
Refactoring¶
Everybody likes to work differently. Some like to eagerly refactor, keeping as little in notebooks as possible (or even eschewing notebooks entirely); whereas others prefer to keep everything in notebooks until the last minute.
You are welcome to work in whatever way you’d like, but try to always submit a pull request (PR) for your feature with everything refactored into python modules.
We often find it easiest to refactor frequently, otherwise you might get duplicates of functions across the codebase , e.g. if it's a data preprocessing task, put it in the pipeline at src/pipelines/<descriptive name for task>
; if it's useful utility code, refactor it to src/utils/
; if it's loading data, refactor it to src/getters
.
Tips¶
Add the following to your notebook (or IPython REPL):
%load_ext autoreload
%autoreload 2
Now when you save code in a python module, the notebook will automatically load in the latest changes without you having to restart the kernel, re-import the module etc.
Share with gists¶
As the git filter above stops the sharing of outputs directly via. the git repository, another way is needed to share the outputs of quick and dirty analysis.
We suggest using Gists, particularly the Gist it notebook extension which adds a button to your notebook that will instantly turn the notebook into a gist and upload it to your github (as public/private) as long.
This requires jupyter_nbextensions_configurator
and a github personal access token.
Don't install jupyter
/jupyterlab
in your environment, use ipykernel
¶
You should avoid jupyter
/jupyterlab
as a dependency in the project environment.
Instead add ipykernel
as a dependency. This is a lightweight dependency that allows jupyter
/jupyterlab
installed elsewhere (e.g. your main conda environment or system installation) to run the code in your project.
Run python -m ipykernel install --user --name=<project environment name>
from within your project environment to allow jupyter to use your project's virtual environment.
The advantages of this are:
- You only have to configure
jupyter
/jupyterlab
once - You will save disk-space
- Faster install
- Colleagues using other editors don't have to install heavy dependencies they don't use
Note: ipykernel
is also listed in requirements_dev.txt
so you do not need to add it.
Report - outputs/reports
¶
You can write reports in markdown and put them in outputs/reports
and reference plots in outputs/figures
.
Tree¶
├── <REPO NAME> | PYTHON PACKAGE
│ ├── __init__.py |
│ ├── analysis/ | Analysis
│ ├── config | Configuration
│ │ ├── logging.yaml | logging configuration
│ │ ├── base.yaml | global configuration (e.g. for tracking hyper-parameters)
│ │ └── pipeline/ | pipeline configuration files
│ ├── getters/ | Data getters
│ ├── notebooks/ | Notebooks
│ ├── pipeline/ | Pipeline components
│ └── utils/ | Utilities
├── docs/ | DOCUMENTATION
├── pyproject.toml | PROJECT METADATA AND CONFIGURATION
├── environment.yaml | CONDA ENVIRONMENT SPECIFICATION (optional component)
├── inputs/ | INPUTS (should be immutable)
├── LICENSE |
├── outputs/ | OUTPUTS PRODUCED FROM THE PROJECT
├── Makefile | TASKS TO COORDINATE PROJECT (`make` shows available commands)
├── README.md |
├── .pre-commit-config.yaml | DEFINES CHECKS THAT MUST PASS BEFORE git commit SUCCEEDS
├── .gitignore | TELLS git WHAT FILES WE DON'T WANT TO COMMIT
├── .github/ | GITHUB CONFIGURATION
├── .env | SECRETS (never commit to git!)
├── .envrc | SHARED PROJECT CONFIGURATION VARIABLES
└── .cookiecutter | COOKIECUTTER SETUP & CONFIGURATION (user can safely ignore)
The Makefile¶
Where did the Makefile go? Previously, our cookiecutter used a Makefile to help manage the environment and its dependencies. However, due to its growing size and complexity, it was becoming difficult to maintain. The Makefile commands were also obfuscating the underlying tools it used, preventing users from growing their experience and confidence in directly managing their project's environments.
We decided to do away with the Makefile and hope to support everyone in their journey of directly managing their projects and dependencies directly.