Structure¶
This page gives a guide to where things belong within the cookiecutter structure.
A direct tree representation of the folder hierarchy is also given at the bottom.
Here are a couple of examples from projects:
Note: In the following sections we use src/
to denote the project name to avoid awkward <project_name>
placeholders.
Project configuration - Makefile
¶
We use make
to manage tasks relating to project setup/configuration/recurring tasks.
make
is one of the simplest ways for managing steps that depend on each other, such as project configuration and is a common tool on Unix-based platforms.
Running make
from the project base directory will document the commands available along with a short description.
You should run make install
when you start working on a cookiecutter project. This will create a conda environment with the name of your project and configure git as well.
For more info on the Makefile, see here.
Git hooks¶
We use pre-commit to check the integrity of git commits before they happen.
The steps are specified in .pre-commit-config.yaml
.
Currently the steps that are taken are:
- Run the black code auto formatter
- This means we can have a consistent code style across projects when we are collaborating with other members of the team.
- Check that no large files were accidentally committed
- Check that there are no merge conflict strings (e.g.
>>>>>
) lingering in files - Fix the end of files to work across operating systems
- Trim trailing whitespace in files
- Check Toml files are well formed
- Check Yaml files are well formed
- Check we are no committing directly to
dev
,master
, ormain
- Run the prettier formatter (covers files such as Markdown/JSON/YAML/HTML)
Warning: You need to run git commit
with your conda environment activated. This is because by default the packages used by pre-commit are installed into your project's conda environment. (note: pre-commit install --install-hooks
will install the pre-commit hooks in the currently active environment).
Reproducable environment¶
The first step in reproducing someone else’s analysis is to reproduce the computational environment it was run in. You need the same tools, the same libraries, and the same versions to make everything play nicely together.
By listing all of your requirements in the repository you can easily track the packages needed to recreate the analysis.
Whilst popular for scientific computing and data science, conda poses problems for collaboration and packaging:
- It is hard to reproduce a conda-environment across operating systems
- It is hard to make your environment "pip-installable" if your environment is fully specified by conda
Files¶
Due to these difficulties, we recommend only using conda to create a virtual environment and list dependencies not available through pip install
(one example of this is graph-tool
).
-
environment.yaml
- Defines the base conda environment and any dependencies not "pip-installable". -
requirements.txt
- Defines the dependencies required to run the code.If you need to add a dependency, chances are it goes here!
-
requirements_dev.txt
- Defines development dependencies.These are for dependencies that are needed during development but not needed to run the core code. For example, packages to build documentation, run tests, and
ipykernel
to run code injupyter
(It's likely that you never need to think about this file)
Commands¶
make conda-update
- Update an existing conda environment (created bymake install
) fromenvironment.yaml
and runmake pip-install
.make conda-remove
- Remove an existing conda environment, tidying up the cookiecutters internal state.make pip-install
- Install our package and requirements in editable mode (including development dependencies).
Secrets and configuration - .env.*
and src/config/*
¶
You really don't want to leak your AWS secret key or database username and password on Github. To avoid this you can:
Store your secrets in a special file¶
Create a .env
file in the project root folder. Thanks to the .gitignore
, this file should never get committed into the version control repository.
Here's an example:
# example .env file
DATABASE_URL=postgres://username:password@localhost:5432/dbname
OTHER_VARIABLE=something
We also have .envrc
which contains non-secret project configuration shared across users such as the bucket that our input data is stored in.
direnv
automatically loads .envrc
(which itself loads .env
) making our configuration available.
Store Data science configuration in src/config/
¶
If there are certain variables that are useful throughout a codebase, it is useful to store these in a single place rather than having to define them throughout the project.
src/config/base.yaml
provides a place to document these global variables.
For example, if you were working on a fuzzy-matching the PATSTAT patent database to the Companies House database and wanted to only merge above a certain match score you may add a section to the configuration like the following,
patstat_companies_house:
match_threshold: 90
and load that value into your code with,
from src import config
config["patstat_companies_house"]["match_threshold"]
This centralisation provides a clearer log of decisions and decreases the chance that a different match threshold gets incorrectly used somewhere else in the codebase.
Config files are also useful for storing model parameters. Storing model parameters in a config makes it much easier to test different model configurations and document and reproduce your model once it’s been trained. You can easily reference your config file to make changes and write your final documentation rather than having to dig through code. Depending on the complexity of your repository, it may make sense to create separate config files for each of your models.
For example, if training an SVM classifier you may want to test different values of the regularisation parameter ‘C’. You could create a file called
src/config/svm_classifier.yaml
to store the parameter values in the same way as before.
Note - as well as avoiding hard-coding parameters into our code, we should never hard-code full file paths, e.g. /home/Projects/my_fantastic_data_project/outputs/data/foo.json
, as this will never work on anything other than your machine.
Instead use relative paths and make use of src.PROJECT_DIR
which will return the path to your project's base directory. This means you could specify the above path as f"{src.PROJECT_DIR}/outputs/data/foo.json"
and have it work on everyone's machine!
Data - inputs/data
, outputs/data
, outputs/.cache
¶
Generally, don't version control data (inputs or outputs) in git, it is best to use s3 (directly or through metaflow) to manage your data.
inputs/data
¶
Put any data dependencies of your project that your code doesn't fetch here (E.g. if someone emailed you a spreadsheet with the results of a randomised control trial).
Don't ever edit this raw data, especially not manually or in Excel. Don't overwrite your raw data. Don't save multiple versions of the raw data. Treat the data (and its format) as immutable.
Ideally, you should store it in AWS S3. You can then use the ds-utils package, which has a neat way of pulling in data into dataframe. Alternatively, if you set the S3_INPUT_PATH
environment variable (e.g. in .envrc
) then you can use make inputs-pull
to pull data from the configured S3 bucket.
outputs/.cache/
¶
This folder is for ephemeral data and any pipeline/analysis step should be runnable following the deletion of this folder's contents.
For example, this folder could be used as a file-level cache (careful about cache invalidation!); to download a file from the web before immediately reading, transforming, and saving as a clean file in outputs/data
; or to temporary data when prototyping.
outputs/data
¶
This folder should contain transformed/processed data that is to be used in the final analysis or is a data output of the final analysis. Again, if this data is sensitive, it is always best to save on S3 instead!
Try to order this folder logically. For example, you may want subfolders organised by dataset, sections of analysis, or some other hierarchy that better captures your project.
Fetching/loading data - src/getters
¶
This folder should contain modules and functions which load our data. Anywhere in the code base we need to load data we should do so by importing and calling a getter (except prototyping in notebooks).
This means that lots of calls like pd.read_csv("path/to/file", sep="\t", ...)
throughout the codebase can be avoided.
Following this approach means:
- If the format of
path/to/file
changes then we only have to make the change in one place - We avoid inconsistencies such as forgetting to read a column in as a
str
instead of anint
and thus missing leading zeros - If we want to see what data is available, we have a folder in the project to go to and we let the code speak for itself as much as possible - e.g. the following is a lot more informative than an inline call to
pd.read_csv
like we had above
Here are two examples:
# File: getters/companies_house.py
"""Data getters for the companies house data.
Data source: https://download.companieshouse.gov.uk/en_output.html
"""
import pandas as pd
def get_sector() -> pd.DataFrame:
"""Load Companies House sector labels.
Returns:
Sector information for ...
"""
return pd.read_csv("path/to/file", sep="\t", dtype={"sic_code": str})
or using ds-utils:
#File: getters/asq_data.py
"""Data getters for the ASQ data.
"""
import pandas as pd
from nesta_ds_utils.loading_saving import S3
def get_asq_data() -> pd.DataFrame:
"""Load ASQ data for assessments taken in 2022.
Returns: Dataframe of the ASQ data at individual level including information on …
"""
return S3.download_obj(
bucket="bucket_name",
path_from="data/raw/data_asq.csv",
download_as="dataframe",
kwargs_reading={"engine": "python"},
)
Pipeline components - src/pipeline
¶
This folder contains pipeline components. Put as much data science as possible here.
We recommend the use of metaflow to write these pipeline components.
Using metaflow:
- Gives us lightweight version control of data and models
- Gives us easy access to AWS batch computing (including GPU machines)
- Makes it easy to take data-science code into production
Shared utilities - src/utils
¶
This is a place to put utility functions needed across different parts of the codebase.
For example, this could be functions shared across different pieces of analysis or different pipelines.
Analysis - src/analysis
¶
Functionality in this folder takes the pipeline components (possibly combining them) and generates the plots/statistics to feed into reports.
It is easier to say when shomething shouldn't be in analysis
than when something should: If one part in analysis
depends on another, then that suggests that the thing in common is likely either a pipeline component or a shared utility (i.e. sections of analysis
should be completely independent).
It is important that plots are saved in outputs/
rather than in different areas of the repository.
Notebooks - src/notebooks
¶
Notebook packages like Jupyter notebook are effective tools for exploratory data analysis, fast prototyping, and communicating results; however, between prototyping and communicating results code should be factored out into proper python modules.
We have a notebooks folder for all your notebook needs! For example, if you are prototyping a "sentence transformer" you can place the notebooks for prototyping this feature in notebooks, e.g. notebooks/sentence_transformer/
or notebooks/pipeline/sentence_transformer/
.
Please try to keep all notebooks within this folder and primarily not on github, especially for code refactoring as the code will be elsewhere, e.g. in the pipeline. However, for collaborating, sharing and QA of analysis, you are welcome to push those to github.
Refactoring¶
Everybody likes to work differently. Some like to eagerly refactor, keeping as little in notebooks as possible (or even eschewing notebooks entirely); whereas others prefer to keep everything in notebooks until the last minute.
You are welcome to work in whatever way you’d like, but try to always submit a pull request (PR) for your feature with everything refactored into python modules.
We often find it easiest to refactor frequently, otherwise you might get duplicates of functions across the codebase , e.g. if it's a data preprocessing task, put it in the pipeline at src/pipelines/<descriptive name for task>
; if it's useful utility code, refactor it to src/utils/
; if it's loading data, refactor it to src/getters
.
Tips¶
Add the following to your notebook (or IPython REPL):
%load_ext autoreload
%autoreload 2
Now when you save code in a python module, the notebook will automatically load in the latest changes without you having to restart the kernel, re-import the module etc.
Share with gists¶
As the git filter above stops the sharing of outputs directly via. the git repository, another way is needed to share the outputs of quick and dirty analysis.
We suggest using Gists, particularly the Gist it notebook extension which adds a button to your notebook that will instantly turn the notebook into a gist and upload it to your github (as public/private) as long.
This requires jupyter_nbextensions_configurator
and a github personal access token.
Don't install jupyter
/jupyterlab
in your environment, use ipykernel
¶
You should avoid jupyter
/jupyterlab
as a dependency in the project environment.
Instead add ipykernel
as a dependency. This is a lightweight dependency that allows jupyter
/jupyterlab
installed elsewhere (e.g. your main conda environment or system installation) to run the code in your project.
Run python -m ipykernel install --user --name=<project environment name>
from within your project environment to allow jupyter to use your project's virtual environment.
The advantages of this are:
- You only have to configure
jupyter
/jupyterlab
once - You will save disk-space
- Faster install
- Colleagues using other editors don't have to install heavy dependencies they don't use
Note: ipykernel
is also listed in requirements_dev.txt
so you do not need to add it.
Report - outputs/reports
¶
You can write reports in markdown and put them in outputs/reports
and reference plots in outputs/figures
.
Tree¶
├── <REPO NAME> | PYTHON PACKAGE
│ ├── __init__.py |
│ ├── analysis/ | Analysis
│ ├── config | Configuration
│ │ ├── logging.yaml | logging configuration
│ │ ├── base.yaml | global configuration (e.g. for tracking hyper-parameters)
│ │ └── pipeline/ | pipeline configuration files
│ ├── getters/ | Data getters
│ ├── notebooks/ | Notebooks
│ ├── pipeline/ | Pipeline components
│ └── utils/ | Utilities
├── docs/ | DOCUMENTATION
├── environment.yaml | CONDA ENVIRONMENT SPECIFICATION (optional component)
├── requirements.txt | PYTHON DEPENDENCIES NEEDED TO RUN THE CODE
├── requirements_dev.txt | PYTHON DEV DEPENDENCIES (e.g. building docs/running tests)
├── inputs/ | INPUTS (should be immutable)
├── jupytext.toml | JUPYTEXT CONFIGURATION
├── LICENSE |
├── outputs/ | OUTPUTS PRODUCED FROM THE PROJECT
├── Makefile | TASKS TO COORDINATE PROJECT (`make` shows available commands)
├── README.md |
├── setup.py | ALLOWS US TO PIP INSTALL src/
├── setup.cfg | ADDITIONAL PROJECT CONFIGURATION, e.g. flake8
├── .pre-commit-config.yaml | DEFINES CHECKS THAT MUST PASS BEFORE git commit SUCCEEDS
├── .gitignore | TELLS git WHAT FILES WE DON'T WANT TO COMMIT
├── .github/ | GITHUB CONFIGURATION
├── .env | SECRETS (never commit to git!)
├── .envrc | SHARED PROJECT CONFIGURATION VARIABLES
├── .cookiecutter | COOKIECUTTER SETUP & CONFIGURATION (user can safely ignore)
The Makefile¶
A Makefile is a build automation tool that is commonly used in software development projects. It is a text file that contains a set of rules and instructions for building, compiling, and managing the project. The primary role of a Makefile is to automate the build process and make it easier for developers to compile and run their code.
Here are some key points to understand about the role of a Makefile in a codebase:
- Build Automation: A Makefile defines a set of rules that specify how to build the project. It includes instructions for compiling source code, linking libraries, and generating executable files or other artifacts. By using a Makefile, developers can automate the build process and ensure that all necessary steps are executed in the correct order.
- Dependency Management: Makefiles allow developers to define dependencies between different files or components of the project. This ensures that only the necessary parts of the code are rebuilt when changes are made, saving time and resources. Makefiles can track dependencies based on file timestamps or by explicitly specifying the relationships between files.
- Consistency and Reproducibility: With a Makefile, the build process becomes standardised and reproducible across different environments. Developers can share the Makefile with others, ensuring that everyone follows the same build steps and settings. This helps maintain consistency and reduces the chances of errors or inconsistencies in the build process.
- Customization and Extensibility: Makefiles are highly customizable and allow developers to define their own build targets and actions. This flexibility enables the integration of additional tools, such as code formatters, linters, or test runners, into the build process. Developers can easily extend the functionality of the Makefile to suit the specific needs of their project.
- Integration with Version Control: Makefiles are often included in the codebase and tracked by version control systems. This ensures that the build process is documented and can be easily reproduced by other team members. Makefiles can also be integrated into continuous integration (CI) pipelines, allowing for automated builds and tests whenever changes are pushed to the repository.
As part of the cookiecutter, we have a Makefile that can perform some useful administrative tasks for us:
Available rules:
clean Delete all compiled Python files
conda-update Update the conda-environment based on changes to `environment.yaml`
conda-remove Remove the conda-environment cleanly
docs Build the API documentation
docs-clean Clean the built API documentation
docs-open Open the docs in the browser
inputs-pull Pull `inputs/` from S3
install Install a project: create conda env; install local package; setup git hooks
pip-install Install our package and requirements in editable mode (including development dependencies)
By far the most commonly used command is make install
, so don't worry too much about the rest!