FastAI Deep Learning Journey Part 4 : Ethics applied to Deep Learning

We are many times wrong, and algorithms, too, even when perfectly programmed or nailing our goals. Our complex reality is full of unintended consequences. Video recommendations show more extremist and less scientific content, likes makes us stay longer that we actually want in certain pages...

Deep learning is a powerful tool, and with great power, great responsability comes. Before we jumped into the detailed approaches and tools to make deep learning more ethic, it is important to make a disclaimer:

The fact that Deep Learning could not be perfect or by being perfect could have unitended consequences is not reason to stop using it or blame the entire field from flawed use.
It is the developer, but also the owner of the development, responsible to ensure that single minded metrics and the model set up considers the consequences of success anf failure
We need regulation to ensure that the implementation of safety and ethical guardrails is not up to big tech companies, but rather an open and rigorous debate considering experts of other fields, such as lawyers, economists, biologists...

Some examples illustrate issues in Deep Learning:

Bugs and Recourse

Software, with or without deep learning under the hood, have errors. Some errors may provoque an inconvenience (a cat pops up in a footware retail online store) others can cost lives (wrong treatment assigment or required medical coverage prediction).

Feedback Loops

Applications that can affect the kpi they are optimizing such as (watching time in youtube, or sales in ecom) could easily create a negative loop when undesirable content (such conspiracy theories videos) or superfluous consumption (selling things leveraging compulsive behaviours or hacks in people behaviour) are far from the mission statement of those applications.

Bias

Our societal/etnic/gender and other discriminations are to some extend reflected in our data. Criminal records, low women applications to tech jobs ... can be perpetuated if the data and the algorithms do not adjust for that.

The Means cannot be more important than the ends

Every deep learning project should start with a clear why we do it, and what are the impacts of optimizing certain metric. Brainstorming the known and potential unkown consequences of success and failure are key to build resilient and for the common good programs.

The world is full of examples of the flaws in human and artifical systems being trigger by narrow-minded metrics. For example the pursue of GDP growth is exhausting our life support systems and eroding the social tights, exploding individualism and inequality. The obsession with technology design to keep us engaged as long as possible is making us addicted and stay for unintended purposes, making deep work and real social interaction a rare event. We should choose very careful our metrics, and have a very clear end. Why are we doing what we are doing? Is this making our products or services better? Our consumers happier? Our world a most prosperous and resilient place? From the junior progammer to the CEO the private sector have a responsability.

I do not think every application deserves the same level of attention in terms of ethics, as the impacts diverge massively between a data cleansing system for internal operations with respect one assessing medical treatments. We need to understand the amount of risk inherited from the decisions taken by the application.

In the last years I hear tens of podcast on our field and hear intentionally or not very naive or even scapist reasoning over the impacts of their applications on society, for example:

some applications being very vocal about their mission ("connecting people"), and very quiet on their business model (selling people's data to provide customized adds).
some applications making assumptions such as you spending more time in the app is because you freely choose it, rather than asking if we are making the people addicted to it
ignoring that adds influence our purchae behaviour, buying things we do not actually need
ignoring that increasing conversion could be achieved by targetting consumers more sensible to compulsive buying or some sort of disability

Those are very common issues in the most successful tech platforms as we speak. While we are all inspired and could benefit from Youtube, Facebook and Google, the business model but most importantly the single metric focused on some of their algorithms are, to say the least, problematic and we should all contribute to ensure that business thrive using deeplearning with positive net social outcomes. Let's dive deep in to key deeplearning application issues next:

Accountability and Quality

Deep learning practitioners normally live in a black or white environment. They are overtrust in the sense that any outcome from the model is believed to be perfect, unbiased and optimal... Or in the other side of the spectrum, their models are rejected because they are not 100% perfect. I am afraid that wise use of deep learning leans in the middle of those areas.

Data, as our experience of the world is biased. On top of that, programms, even those very well tested could have gaps. Code and data is not perfect, and our algorithms are not perfect too. Even if all the latter would be perfect, our limitation to integrate all relevant metrics in a loss function and to account for all complex consequences make 100% success a quimera. This is no reason to despair.

Machine Learning / Deep Learning applications should be used as long as :

they provide a consitent improvement over the status quo
they are audited and regularly tested and challenged

That means that even if our carbon footprint calculator, or credit scoring is not 100% accurate, we should still use it if it is more reliable that what we have before. The problem most of the times is that we DO NOT have a BASELINE.

A Baseline is the current performance of the metrics we care about, specially those that can be compared with the new application performance. If our current calculations are 65% correct and the new algorithm is 72% correct we should use it. It is actually very likely that using an algorithm, together with humans in the loop, you will quickly to something greater than 72%.

My main message here is: do not blindly trust those applications, but please based your concerns not on examples but rather on rigorous baselines. It going to expose your mistakes, but it will make your team better. Another note, even if you can review everything manually, leverage this technology to challenge you and increase the reliability of your data and decisions. Two pairs of eyes are better than one, and a human and a machine together are better that one of those on their own.

Feedback Loops

The more we can impact our metric (conversion, time spend online, sales...) the more likely we are to enter the bumpy world of feedback loops. Imagine the following scenario: you sell shoes, and someone powerful in the org pushes for a very specific type of shoe to be everywhere and produce massively. The product was not selling well, as expected by the demand forecasting team and product experts, but after a lot of expensive discounting and marketing, we end up selling more of this product than others. The next season the algorithm notice that these types of products sold a big amount, so the next season it will predict it will see a lot , and so we produce a lot again...

We should never forget that for a product of service to be sold, first it has to be visible, it has to be rightly priced, and has to be available. Forgetting than some products or services have more privileged and correct presence than others will perpetuate bad decision making, and reinforce poor decision in the future.

Bias

In this section we talk of bias as the historical, social,measurement, aggregation, evaluation unbalances that our data have to represent truly our reality. The following paper explain in detail those 7 bias sources and how to deal with them:

[1901.10002] A Framework for Understanding Sources of Harm throughout the Machine Learning Life Cycle (arxiv.org)

Historical Bias

Historical bias could mean simply that past years would have little to do with current behaviour (critical for forecasting) or that social missrepresentation of certain groups can be perpetuaded ( low female participation in tech jobs in the past affecting profile search for future positions).

One thing that we should be aware of is that for example some open data sets or benchmark data sets do not include countries is a properly distributed fashion, for example Imagenet contains almost 60% of its data from Canada, US and UK, while they represent less than 10% of the worlds population.

Computer vision is not the only field, as language models are heavily biased toward history ( he is a doctor and she is a nurse instead of the other way around).

Measurement Bias

If we measure the wrong thing we will probably will not go where we want. The most known example is GDP, which was intended to manage militar spending and not wellfare, but given the lack of a clear substitutor it is used over and over to define successful policy when economies grow. There are many other examples of poor kpis definition such as:

engagement in social media
conversion in ecom
net sales in retail
rankings in sports

If they are optimized in isolation we will likely getting where we do not want: addicted users, huge returns rate, wrong pricing signals, athletes engaging in fragmented competitions....

Before we start developing a program, we need to make sure our target is well measured or at least its limitations as a kpi are clearly stated and ideally more than one metric is tracked to validated programm sucess.

Aggregation bias

The following plot shows how easy is to mess up when aggregating the data wrongly and not having good domain knowledge and letting some poorly aggregated data speak:

Do you think that exercise increases cholesterol or not? if you would not consider age groups your data set will look like the one on the right and you would infer a positive correlation between exercise and cholesterol. The reason for that is that without adjusting by age group one would not see the reality, which is that given a certain age those who do more sport have lower cholesterol levels. As much as I like learning and being surprise by data, I suggest to define research backed theories before we run like crazy reporting strong correlations. The book of Why from Judea Pearl is a golden mine for anyone who want to do good data science.

All in all the point here is that : most data and kpis are biased, and there are no shortcuts for good thinking before we jump into modeling. Domain knowledge and robust theories are key for successful applications. I love the quote "there is nothing more practical than a good theory", and I cannot agree more. Listen first, make hypothesis, gather data, model and be ready for bias.

Addressing bias

There is not a single bullet proof to adress bias, but a good list of check items should be part of any decision engine:

The source data should contain clear documentation on how it has been collected
Variables that contain ethnicity, gender or other social grouping should be avoided and also the algorithm should not be able to inferred
There has to be a proper metric setting and programing behaviour tracking to detect : unintended feedback loops or biases. Testing should go beyond software engineering and contain some business and social dimensions.
Audit for bugs in code and data
Audit algorithm methodology, validate with literature and domain knowledge, particuarly when insights contradict years of research
Ensure teams building such programs are as diverse as possible in backgrounds and social groups
Use ethical tools to analyze the compliance of the application : Ethical Toolkit - Markkula Center for Applied Ethics (scu.edu)

The role of regulators

One would probably would not have a law for every potential ethical breakthrough, but regulation should set fastly enough some saferails to ensure every company works within the rules and ethical standards of the society its operates. This is particuarly important when behaving less ethically could mean a competitive advantage ( less costly or more profitable in general).

Penalties are key to generate a use case for ethical ML/AI . This one is a good example:

Facebook hiring hundreds to comply with hate speech law | The Hill

Kids and other vulnerable groups are hit but AI/ML driven apps and as we put limits in TV expousure, the same needs to happen with such apps and digital channels.

Clean air and clean drinking water are public goods which are nearly impossible to protect through individual market decisions, but rather require coordinated regulatory action. Similarly, many of the harms resulting from unintended consequences of misuses of technology involve public goods, such as a polluted information environment or deteriorated ambient privacy. Too often privacy is framed as an individual right, yet there are societal impacts to widespread surveillance (which would still be the case even if it was possible for a few individuals to opt out).

Many of the issues we are seeing in tech are actually human rights issues, such as when a biased algorithm recommends that Black defendants have longer prison sentences, when particular job ads are only shown to young people, or when police use facial recognition to identify protesters. The appropriate venue to address human rights issues is typically through the law.

We need both regulatory and legal changes, and the ethical behavior of individuals. Individual behavior change can’t address misaligned profit incentives, externalities (where corporations reap large profits while offloading their costs and harms to the broader society), or systemic failures. However, the law will never cover all edge cases, and it is important that individual software developers and data scientists are equipped to make ethical decisions in practice.

The example of Cars

One example is the movement to increase car safety, covered as a case study in "Datasheets for Datasets" by Timnit Gebru et al. and in the design podcast 99% Invisible. Early cars had no seatbelts, metal knobs on the dashboard that could lodge in people’s skulls during a crash, regular plate glass windows that shattered in dangerous ways, and non-collapsible steering columns that impaled drivers. However, car companies were incredibly resistant to even discussing the idea of safety as something they could help address, and the widespread belief was that cars are just the way they are, and that it was the people using them who caused problems.

It took consumer safety activists and advocates decades of work to even change the national conversation to consider that perhaps car companies had some responsibility which should be addressed through regulation. When the collapsible steering column was invented, it was not implemented for several years as there was no financial incentive to do so. Major car company General Motors hired private detectives to try to dig up dirt on consumer safety advocate Ralph Nader. The requirement of seatbelts, crash test dummies, and collapsible steering columns were major victories. It was only in 2011 that car companies were required to start using crash test dummies that would represent the average woman, and not just average men’s bodies; prior to this, women were 40% more likely to be injured in a car crash of the same impact compared to a man. This is a vivid example of the ways that bias, policy, and technology have important consequences.

Concluding remarks

Those like me that work and read on the actual developments on DeepLearning are not concerned about Terminators like AI or people falling in love with Alexa, instead, we are concern about the amount of usage with poor understanding of the negative impacts current Deep Learning.

Humans and data have biases, and this is not reason for not working with humans and machines to achieve our goals. Good usage of deep learning requires:

Understanding of how the data has been gathered
Domain knowledge of the field of focus
Sufficiently concrete but hollistic set of metrics to define success
Audit and checks on data, code and impacts of application (on social, environmental and business dimensions)
Regulations that direct innovation toward the common good
Diverse teams in backgrounds (studies, academia, industry...) and societal groups
Scientific and Moral mindset - be rigorous and not evil

With all the below we are much more likely to use wisely this powerful and versatile tool that can make a better world, if we design it and use it correctly.

Questionnaire

Does ethics provide a list of "right answers"? ethics provide a framework to assess the implications on others of our actions or organizations
How can working with people of different backgrounds help when considering ethical questions? Working with people from different background makes easier to identify unintended effects of our applications, as more perspectives are put into design.
What was the role of IBM in Nazi Germany? Why did the company participate as it did? Why did the workers participate? IBM support in creating the software to manage the genocide with the jews (according to the fastai book). The company was not politically engaging with the mission of the Nazis, but working for them as customers. Workers consider they were doing their job and it was not their responsability to judge the ethical usage of IBM's technology.
What was the role of the first person jailed in the Volkswagen diesel scandal? it was the engineer who support in creating the device that evade US pollution rules. VW engineer jailed for emissions scandal - BBC News
What was the problem with a database of suspected gang members maintained by California law enforcement officials? It was full of errors (there were 42 babies)
Why did YouTube's recommendation algorithm recommend videos of partially clothed children to pedophiles, even though no employee at Google had programmed this feature? Youtube system learns that pedofiles tend to search for partially naked videos of children.
What are the problems with the centrality of metrics? The problem is that algorithms focused on one single metric will do everything to optimize that metrics, and there are many undesirable edge cases from which the optimal is not the best solution.
Why did Meetup.com not include gender in its recommendation system for tech meetups? it does not include because it does not want to perpetuate the historical bias of low women attendance for tech meetups.
What are the six types of bias in machine learning, according to Suresh and Guttag? historical bias, representation bias, measurement bias, learning bias, aggregation bias, evaluation bias and deployment bias.
Give two examples of historical race bias in the US.

When bargaining for a used car, Black people were offered initial prices $700 higher and received far smaller concessions.
Responding to apartment rental ads on Craigslist with a Black name elicited fewer responses than with a white name.

Where are most images in ImageNet from? US, Canada and UK
In the paper "Does Machine Learning Automate Moral Hazard and Error" why is sinusitis found to be predictive of a stroke? Due to the poor registration of stroke as a disease, symptons not related to stroke appeared to be strong predictors.
What is representation bias? is the creation of a sample that fails to represent properly certain group of the population, and therefore does not generalize well for such subsets.
How are machines and people different, in terms of their use for making decisions? Machines have a single focus metric to optimize, while humans are better are applying expections or finding edge cases.
Is disinformation the same as "fake news"? disinformation could be leveraged by partial true information in order to create confusion or discredit a scientific idea. For example the fact that there is temporal cooling in some areas does not reject the idea of global warming, just shows a narrow picture to confuse people expecting homogeneous and linear global warming. Fake news are intentionally false information distributed to create confusion, pollarization or support political propaganda. The first is slightly more sophisticated than the later.
Why is disinformation through auto-generated text a particularly significant issue? As language models improve, it becomes harder and harder to identify false text and to identify real or false comments in social platforms. What actual people have said or not is hard to differentiate. Deep fakes is a real risk today. https://www.youtube.com/watch?v=d7TWHRtw5u8
What are the five ethical lenses described by the Markkula Center?

The rights approach:: Which option best respects the rights of all who have a stake?
The justice approach:: Which option treats people equally or proportionately?
The utilitarian approach:: Which option will produce the most good and do the least harm?
The common good approach:: Which option best serves the community as a whole, not just some members?
The virtue approach:: Which option leads me to act as the sort of person I want to be?

Where is policy an appropriate tool for addressing data ethics issues? We need both regulatory and legal changes, and the ethical behavior of individuals. Individual behavior change can’t address misaligned profit incentives, externalities (where corporations reap large profits while offloading their costs and harms to the broader society), or systemic failures. However, the law will never cover all edge cases, and it is important that individual software developers and data scientists are equipped to make ethical decisions in practice.

Further Research

Read the article "What Happens When an Algorithm Cuts Your Healthcare". How could problems like this be avoided in the future?

Implement a human in the loop system when significant cuts are recommended
Perform audits on code and model quality
Perform random checks between human and machine recommendations to detect flaws in the process

Research to find out more about YouTube's recommendation system and its societal impacts. Do you think recommendation systems must always have feedback loops with negative results? What approaches could Google take to avoid them? What about the government?

I think instead of maximizing screen time, one can try to recommend offical sources for tricky topics such as corona /climate change ... specially when the scientific consensus is so vast. There should be also some freedom of choice to not block all non-official content.
Giving more power to the user in terms of sources and content type will definetely help

Read the paper "Discrimination in Online Ad Delivery". Do you think Google should be considered responsible for what happened to Dr. Sweeney? What would be an appropriate response? Any recommendation that could significantly harm the reputation and live of someone should be at least immediately fixed. The amount of penalty would depend of how negatively that error affects the person or the avoidability of the issue.
How can a cross-disciplinary team help avoid negative consequences? if Deep Learning applications are build with perspective from anthropologists, doctors, economists, ecologists, engineers ... are not only less prone to unitended consequences but probably serving more hollistically their purpose.
Read the paper "Does Machine Learning Automate Moral Hazard and Error". What actions do you think should be taken to deal with the issues identified in this paper?

Ensure there is an audit in terms of explanatory /target variable tracking quality
Ensure there is domain expertise and good causal modeling in the application build, particularly when causal relantionships are estimated

Read the article "How Will We Prevent AI-Based Forgery?" Do you think Etzioni's proposed approach could work? Why? "Digital signatures will not prevent a bot from masquerading as some person, but the signatures will stop the bot from impersonating you, and from disseminating content that you didn’t author in your name. The computer methods to support reliable digital signatures exist, but are not seamless enough for ubiquitous use. We need to jumpstart “zero click” digitally-signed emails, social-media posts, documents, images, videos, and even phone calls before it’s too late." I think we need two factor authentification for every single content creation on our name in the internet. Digital signature without two factor authentification could also be flawed, as it is enough to hack one device to fake a person.
Complete the section "Analyze a Project You Are Working On" in this chapter. Our project encodes product data in a way that helps product creators, planners and managers to decide what to produce, how much, at which price. As we are not using consumer specific data, the ethical impact that we see in other applications is less likely to occur, nevertheless we are subject to our biases : we rely on history to encode data, so any bias in that regard would be expanded. I n order to avoid to get stucked is important to feed our internal systems with external data to capture trends and not only history to define product features.
Consider whether your team could be more diverse. If so, what approaches might help? Our teams have members of more than 7 nations ( Germany, Iran, India, Nigeria, Italy, Paraguay, Catalonia, Hungary.. ) but our gender split is not equal. We need to actively search to increase the presence of female experts in the team and make those roles attractive and also retain that talent.

Alan Fortuny Sicart

Search This Blog