If you struggle with explaining why your data science project is difficult, it’s often down to data not being “ready to go”. But how to explain that to people that never worked with data themselves?
A great idea to help explain “what’s taking so long” to get results out of data is the concept of data readiness. It allows you to state exactly how far the data is away from where it needs to be to perform an analysis.
Data readiness consists of three bands, each of which we can divide further into more specific data readiness levels. The best data is A1, and the worst might be C4.
Let’s start from the worst and work our way up to the best!
Data in the C-band is not accessible for the data scientist in her tool. Hard to run a model or do an analysis on that kind of data…
C4: Hearsay: We believe the data exists but unsure if, where or how it can be accessed in the given context.
C3: Remote: We know where the data is sitting but for technical, financial, ethical or political reasons, the data can not be collected and loaded/transferred into the necessary data processing environments.
C2: Unstructured: The data is unstructured and would first need to be organized accordingly (i.e. handwritten text, PDF reports, e-mails, voice recordings, messy directory structures, inconsistent naming/indexes, old machine formats, etc.)
C1: Loadable: Not yet quite there, but in principle ready to be loaded into analysis software (e.g. Python or MS Access) from central, high performance data-store (e.g. Data-warehouse) or at least available in central, well-structured space (server, file-share, etc.)
Data in the B-band is unclear and must be preprocessed and explored before going into inference or modelling stage. In other words: it’s not yet in a good representation, or format.
B3: Raw: Directly after loading it from the central data source. There might be missing values, corrupted formatting, strangely encoded values or other things that require cleaning up.
B2: Explored: The scope of the data are clear (e.g. distributions of values including outliers, missing values, etc.) and it’s well described. We know what’s there and what can be roughly expected from the data.
B1: Cleaned: The data is clean from obviously wrong entries, encoded values have been mapped and missing values imputed where applicable, etc. It’s clear what the data contains and what are it’s inherent limitations (i.e. what can’t be fixed).
Data in A-Band is ready for analysis with respect to a given hypotheses, questions or context.
A3: Relevant: The data seems to contain signal, i.e. relevant information for the task at hand. But maybe it’s not yet in the right format (e.g. text not yet tokenized for RNN) or it needs to be combined with other data.
A2: Compliant: After understanding the data and it’s content and consulting with legal & compliance (or your local regulation), we are allowed to use the data for the given context.
A1: Ready: The data can now be fed directly to the model/analysis.
It can happen that data is in A1 for a given task, but at the same time never reaches A-Band for another task (e.g. because it’s not relevant there)
Let’s bring it together. The data readiness levels are:
Disclaimer: Data readiness was originally introduced by Lawrence. I added the more specific levels based on experience from my daily work. This is v1 (a first draft).
Lawrence, N.D.. (2017). Data Readiness Levels. Available from http://inverseprobability.com/publications/data-readiness-levels.html.