Constructing an industrial taxonomy using business website descriptions¶
Example under construction
This example project is (loosely) based on work ongoing in nestauk/industrial-taxonomy
, and the structure below is based on lessons learned in this project.
The project is split into four high-level tasks (,
,
,
) which we walk through.
Elements of in particular have been simplified to keep the emphasis on the project structure rather than the project itself.
You can skip ahead to the project tree if you want a birds-eye view.
Matching of Glass to Companies House¶
Method
By fuzzy-matching data about UK business websites to Companies House based on company names we obtain a link between the text on business websites (describing businesses' activities) and the SIC codes (official industry codes) of that company.
This work was performed in a separate project with the results stored and versioned in S3 by Metaflow. This can easily be retrieved using the metaflow client API.
getters/inputs/{glass,companies_house.py,glass_house.py}
-
Fetch all the data via. metaflow's client API.
analysis/eda
-
Exploratory data analysis of these data-sources
SIC classifier¶
Method
Using the matched "glass-house" dataset train a classifier to predict SIC codes (this is developed as a general industry classifier that is agnostic to the SIC taxonomy as it is used elsewhere in the project).
We can then conduct a meta-analysis looking at SIC codes that are under-represented by the classifiers predictions on validation data when compared to their "true" label obtained from the "glass-house" matching.
-
pipeline/industry_classifier/{log_reg_model.py,transformer_model.py}
-
Two competing models (not specific to SIC taxonomy)
config/pipeline/industry_classifier/sic/*.yaml
-
Parameterisation of model flows.
What is the extra
sic
folder doing in the filepath?There's an extra folder,
sic
, in theconfig/
path that isn't present in thepipeline/
path because this project uses the industry classifier models across different taxonomies.In this case,
sic
denotes the fact that we are applying it to the SIC taxonomy and gives us a namespace within theconfig/
directory to isolate multiple uses of the same flow.Whats with all the
config/pipeline/**/*.yaml
?In this example structure, we have individual YAML files for each pipeline component.
Whilst you are free to mimic this structure now, it is currently more convenient to nest the
config/pipeline/**
structure withinbase.yaml
as it is easily importable as a pythondict
-from src import config
.Furthermore, because metaflows are run from the command line and we want to parameterise them with YAML from
config/pipeline**
, eachflow.py
file currently needs an accompanyingrun.py
file to: - Load and parse the YAML config needed for the pipeline - Form the pipeline arguments into a command to run the metaflow on the command line - Run the metaflow from withinrun.py
using Python'ssubprocess
library. - Update a config file with the successful metaflow run ID (so that getters know which version of the data to fetch)This is a lot of leg-work and increases the surface-area for bugs, you may be better off hard-coding values into shell-scripts or a Makefile.
getters/outputs/sic_classifier.py
-
Load trained model, giving access to predict functionality
Why separate inputs and outputs in
getters/
?Separating inputs and outputs in
getters
is useful when reading the code - it allows us to differentiate between what is produced in this project and what we depend on from elsewhere.This is less useful when writing code - the import
from src.getters.inputs.glass import get_sector
is very long. To provide a shorter import we can do the following:# File: src/getters/__init__.py from .inputs import glass # File: src/analysis/example.py from src.getters.glass import get_sector
Your directory structure doesn't always have to reflect the user-API!
Avoid
import
ing fromsrc/pipeline/
If you find yourself importing functions from
src/pipeline
insrc/analysis
then that functionality likely belongs insrc/utils
.Furthermore,
src/analysis
should get results ofsrc/pipeline
via. functions insrc/getters
.One exception that may occasionally arise when working with metaflow is needing to import a flow object itself - e.g. to access a static method or class method it defines (metaflow doesn't permit storing functions as data artifacts).
analysis/sic_classifier/
-
Analysis of industry classifier models applied to SIC taxonomy
model_selection.py
- Evaluate competing models and pick bestsic_meta_analysis.py
- Meta-analysis looking at SIC codes under-represented in predictions and for which the model is over/under confident (informs which parts of the SIC taxonomy could be improved)
Hierarchical topic modelling¶
Method
By training a TopSBM hierarchical topic model on the business website descriptions we can use the topics and clusters generated by the model and combine them with the SIC code labels to generate measures of sector similarity (how similar are any two SIC codes based on their cluster membership probabilities) and sector homogeneity (how homogeneous are is the topic distribution of descriptions aggregated by SIC code).
Pre-processing¶
The first step is to process raw business website descriptions into clean, tokenised, n-grammed representation that can be passed to the topic model (this is re-used in ).
-
pipeline/glass_description_ngrams/{flow,utils}.py
-
Metaflow to run spacy entity recognition; convert to tokens; and generate n-grams based on co-occurrence
flow.py
andutils.py
We have lots of
flow.py
andutils.py
. Some might see this as bad because it's not super-informative; however as long as the parent folder has an informative name then it's good enough. config/glass_description_ngrams.yaml
-
Parameterisation of the above metaflow.
getters/outputs/glass_house.py
-
Getter to fetch tokenised n-grams of business website descriptions.
pipeline/glass_description_ngrams/notebooks/
-
Sanity-checking of output results. Not a part of
analysis/
because not directly analysing/presenting these results.
Modelling¶
Now the topic model itself can be run.
config/pipeline/topsbm.yaml
-
No corresponding flow in pipeline! Imported from a different library (e.g.
ds-utils
) and used here getters/outputs/topsbm.py
-
Fetch fitted model instance containing our inferred topics and clusters
analysis/topsbm/
-
model_metadata.py
- Output summary table of model fit, other metadata such as topic hierarchy, top words etc. -sector_similarity.py
- Pair-wise similarity of SIC codes calculated using topsbm model outputs -sector_homogeneity.py
- Homogeneity of SIC codes calculated using topsbm model outputs -utils.py
-
Hang on... Why is
section_*.py
not a pipeline component?It would be equally valid to place these in
pipeline/
(because they are computing transformations of data) butanalysis/
is also fine (and possibly better) because:- Nothing else depends on these (so we aren't forced to refactor outside
analysis/
) - The transformations done by these scripts are relatively quick (therefore there's no need to refactor into a metaflow in
pipeline/
to save others from having to recompute a long-running analysis) - These scripts would need to exist anyway to visualise and summarise the results for reporting
Hang on... doesn't
utils.py
imply shared functionality inanalysis
?If we had a flatter
analysis/
folder - e.g. everything inanalysis/topsbm/
was moved intoanalysis/
- this would be unacceptable; however it's just about okay to have a short utils file here.If we weren't happy with this we could put it in
utils/topsbm.py
: - For: other pieces of analysis or pipeline components may need to use these functions in the future, now they can without refactoring - Against: in this case it's only one common function which we're pretty sure is only needed here and now lives further away from where it's used
Build a data-driven taxonomy¶
Method
- Identify relevant terms within business website descriptions
- Build a co-occurrence network of terms and prune network
- Decompose co-occurrence into communities
- Label companies with their communities to add a new level onto the existing SIC taxonomy
- Apply industry classifier to perform similar meta-analysis to that done in
Keyword extraction¶
Use keyword extraction methods to tag business descriptions with items from the UNSPSC (a products and services taxonomy).
Given this is successful, co-occurrence networks of products and services can be used to build a taxonomy. If this is unsuccessful, fall back on the n-gramming pipeline produced in .
inputs/data/UNSPSC_English_v230701.xlsx
-
New dataset for this project provided by the supplier as an excel spreadsheet
analysis/eda/unspsc.py
-
Explore the UNSPSC dataset
getters/inputs/unspsc.py
-
Function to load UNSPSC data
- Note: It's structured and clean enough that we don't need to do any preprocessing on it
-
pipeline/keyword_extraction/*.py
-
Metaflows to extract keywords from text using various methods and filter based on prescence in UNSPSC
config/pipeline/keyword_extraction/*.yaml
-
Parameterise above pipelines
analysis/keyword_extraction/notebooks/
-
None of the keyword extraction approaches worked out
No need to refactor notebooks into script if materials are not produced for final reporting.
Notebooks exploring results of keyword extraction methods and comparing their effectiveness.
N-gramming pipeline¶
getters/outputs/glass_house.py
-
Use same getter as produced in preprocessing step of
.
Constructing a term co-occurrence network and generating communities¶
-
pipeline/kw_cooccurrence_taxonomy/*.py
-
Metaflow and utils to construct a co-occurrence network of terms; decompose into communities; and label companies with their community labels
config/pipeline/kw_cooccurrence_taxonomy.yaml
-
Parameterise above flow
analysis/kw_cooccurrence_taxonomy/visualise_structure.py
-
Visualise the structure of the new taxonomy (and it's dependent communities)
Applying the industry classifier to the new taxonomy level¶
config/pipeline/industry_classifier/kw_cooccurrence_taxonomy.yaml
-
Apply the industry classifier to our new taxonomies labels
analysis/kw_cooccurrence_taxonomy/industry_classifier.py
-
Perform a meta-analysis for our new taxonomy as was done in
for the SIC taxonomy
Project tree¶
├── inputs
│  └── data
└── src
├── analysis
│  ├── eda
│  │  └── notebooks
│  ├── keyword_extraction
│  ├── kw_cooccurrence_taxonomy
│  │  └── notebooks
│  ├── sic_classifier
│  │  └── notebooks
│  └── topsbm
├── config
│  └── pipeline
│  └── industry_classifier
│  ├── kw_cooccurrence_taxonomy
│  └── sic
├── getters
│  ├── inputs
│  └── outputs
├── pipeline
│  ├── glass_description_ngrams
│  ├── industry_classifier
│  │  └── notebooks
│  ├── keyword_extraction
│  │  └── notebooks
│  ├── kw_cooccurrence_taxonomy
│  │  └── notebooks
│  └── sic_taxonomy_data
└── utils
├── altair
└── metaflow
TODO