Metaflow Guide¶
Recap: What can Metaflow do for you?¶
Data versioning¶
Each execution of a Flow has a unique run ID which data artifacts are stored under.
The artifacts and metadata of each run can be fetched using the Metaflow client API.
Scalable computing¶
Metaflow abstracts infrastructure making horizontally and vertically scalable cloud computing easily available to Data Scientists.
In some (not all) cases it's as easy as adding --with batch to the end of the command to run a flow!
Sharing results¶
Data artifacts of runs can be stored and versioned not only on your machine but also to AWS S3, meaning you can share your results easily.
Get you out of dependency hell¶
Ever run into a situation where one part of a project needs a package that conflicts with another?
Metaflow allows you to specify dependencies at the flow level and automatically builds a Conda environment and runs your flow in it.
Make your pipelines easier to understand¶
Expressing a script as a DAG instead makes things a lot easier to understand.
Metaflow also has commands such as show and output-dot which visualise this structure.
Checkpointing¶
It's annoying when your code has gotten halfway through a pipeline only to tail.
Metaflow automatically checkpoints the steps of your flow and lets you resume a run after fixing any issues, saving you having to re-run the whole pipeline or maintain your own checkpointing logic (that may be fraught with conflicts between different run parameters).
Suggested practices¶
Writing flows¶
- Naming a flow
- No need to add
Flowas a suffix, the directory structure should speak to that - Don't make names too generic, e.g. if the flow trains a topic model on Arxiv paper abstracts then call it something like
ArxivAbstractTopicModelrather than justTopicModel
- No need to add
-
Add a test mode
If your flow takes more than a minute or two to run then you should always add a parameter
test_mode = Parameter("test-mode", ...)(by convention) which will run your flow in a test mode that runs quickly, e.g. fetch/run a subset of the data. This facilitates both efficient code review (a reviewer can check the test mode runs successfully without having to wait hours/days for the full flow to run) and automated testing.We recommend that by default
self.test_modeis either:- The logical not of
current.is_production(default=lambda _: not current.is_production) True
If a flow is run with
--production(current.is_productionisTrue) then test mode functionality should not be activated! Whilst often the logical not of one another,self.test_modeandcurrent.is_productionare two different concepts -self.test_modeandcurrent.is_productionmay both beFalsewhen a user wants to run a flow without affecting other users. - The logical not of
-
The only class/function in a flow file should be the Flow itself
- Flows should use the
@projectdecorator (see workflow) using a consistentnameparameter throughout the project - Flow steps should be as minimal as possible containing:
- Setup such as parsing and logging parameters or declaring context managers
- At most a few high-level function calls
- Any flow-level logic - e.g. merging artifacts
- Consider using type-annotations for the data artifacts
-
Imports used within a step should always go within a step (at the top of it)
Because metaflow has the ability to abstract infrastructure then different steps may run on different machines with different environments.
from kuebiko.pipeline.sic.utils import *wouldn't work in a step -kuebikoisn't installed/packaged in the new conda environment. -
Avoiding saving data artifacts as dataframes (or other library-specific classes) favouring standard Python data-structures instead.
Python data-structures do not impose the Pandas dependency on the downstream consumer of the data who may be working in an environment where Pandas isn't available (e.g. AWS lambda) or may have a different version of Pandas which when your artifact is loaded may subtly differ or fail to load. If dataframes are persisted as regular Python data-structure, the downstream consumer can still generate a dataframe if they want.
-
When data is too large to be saved as a data artifact (not enough RAM or pickling/unpickling is inefficient) ... TODO
-
Docstrings for steps:
- Always for
startandend- their name cannot convey any information about what the step does - For any step whose name cannot convey the essence of the step or for complex steps (steps that are more than one or two well documented function calls)
- Always for
Workflow¶
-
Use
localdatastore during developmentEither pass
--datastore=localwhen running a flow or addMETAFLOW_DEFAULT_DATASTORE=localto.envrcto change the default datastore.This reduces the overhead to run a flow but means the results are local to your machine (the metadata of the run is still stored in an AWS RDS instance).
-
Use
datastore=s3with--productionto run your flow such that the artifacts can be seen by another user. - Use the namespace relating to your
@projectname, e.g. if you're using@project(name="kuebiko")for your flows, usemetaflow.namespace("project:kuebiko")- Setting that in
kuebiko/__init__.pymeans it gets set whenever you use a getter
- Setting that in
- Using
--metadata service(the default) means that your flow metadata will be visible via. the Metaflow UI at https://dap-tools.uk (more information will be available if using the S3 datastore too)!
Things to watch out for¶
Gotchas¶
- You cannot save functions as data artifacts, they get ignored. You can however save data artifacts with functions inside, e.g.
self.artifacts = {"key": lambda _: "value"}
Bugs¶
Here are a bugs you may encounter:
- Using
resumeon a failed flow that usesmetaflow.S3may fail/be inconsistent- https://github.com/Netflix/metaflow/issues/664
- Metaflow's Conda decorators don't work with Python 3.9
- https://github.com/Netflix/metaflow/issues/399
Poor errors¶
Here are errors you may encounter which are uninformative:
TypeError: Can't mix strings and bytes in path components- Use ofmetaflow.S3with an incorrect metaflow profile