[ZK Lab] #1: Concept - Data Centric AI

Data Centric AI

Data-centric AI is the discipline of systematically engineering the data used to build an AI system.

A place to share cutting edge techniques and best practices for using data centric AI methods to build successful machine learning systems.

Labeling and Crowdsourcing

Data limitations:
- Domain gaps: The data you train your model with is quite different from the data you have to predict on in the real world.
- Data bias: When the data you collect has imbalances due to societal bias, how can you design methods that can overcome them?
- Data noise: Noise can come from a variety of sources, including where labels are ambiguous, cluttered, or otherwise corrupted.
Definition of Data Augmentation:
- Self-Supervision: When you have limited labeled data, you can try combining it with unlabeled data.
  - rotation
  - cropping
- Synthetic Data: While synthetic data is still in its infancy, there has been ongoing advances in generative models and it will become hugely important in the future for testing systems such as autonomous driving or robot learning.
Data-centric principles in data augmentation:
- Core: the balance of positive and negative examples.
- Also: combine self-supervision with weak supervision.

Labeling and Crowdsourcing

Technical debt in machine learning:
- The concept of technical debt originally comes from the world of software engineering, where it has often been found that pushing to develop software very quickly can create long term maintenance costs that must be paid back later, and that if left unaddressed can compound over time.
- ML code – the bit that we tend to think of as the cool part – is actually a small component of the overall system.
- In a lot of settings it is not actually statistically useful to get more data.
Three places to start:
- Audit and monitor data quality.
- Create data sheets for data sets.
- Create and apply stress tests using data.