Introduction
One of the key aspects of AI relates to how it can be used (or abused) with sensitive data and privacy in general. In this article, we'll look at some key points on this topic. This article is not meant to be comprehensive but I'll do my best to keep it comprehensible.
Masking Personally Identifiable Information (PII)
Personally identifiable information, or PII for short, is key for preserving a sense of privacy. This sort of information (or data rather) is all those fields in a dataset that can help someone track down the particular individuals corresponding to those records in the dataset. In many cases, this is not that catastrophic but when it comes to medical and financial data, for example, people tend to be very protective of their PII. Masking PII is key for maintaining those people's privacy and it's a common practice in data science. However, it's not as easy as substituting the data with something incomprehensible to the hackers or omitting those fields altogether. Still, that's what most people do these days as it's a fairly easy and scalable way to maintain the information of a dataset without getting rid of the PII altogether.
AI systems can dig up PII
So, where does AI enter the picture? Well, AI systems are quite sneaky these days. As such, when used for this purpose, they can dig up PII in a dataset, even if the sensitive data is not readily available. In other words, the data scientist may do her best to conceal this private data but the AI systems may manage to figure it out anyway. The reason is simple: AIs are exceptionally good at predicting stuff, even if that prediction is not 100% accurate. However, they get close enough to do some real damage when it comes to privacy.
PII needs to remain obscure even if AI is applied
That’s why special care needs to be taken so that PII remains obscure, even if someone were to apply AI to that dataset. In other words, just like a cybersecurity expert will try to break his code before releasing it to the world as a secure solution, a data scientist ought to do the same with her dataset, in terms of PII protection. At the very least this can help the organization she works for to avoid potential lawsuits or heavy fines.
AI can be used to generate synthetic data containing no PII
Fortunately, AI can be used for good too, something most sci-fi films forget to tell us! Namely, certain AIs are adept at generating synthetic data which by definition contains no PII whatsoever. This data can then be used instead of the original PII rich data. Also, if that (synthetic) dataset is leaked, it doesn’t matter much since it cannot yield any information to the hackers. AI systems that generate synthetic data are either in the Variational Autoencoders (VAEs) or the Generative Adversarial Networks (GANs) family of models.
AI can be used to identify potential PII related fields
In addition to all that, AI can be used to identify potential PII related fields in a dataset. These are fields that may elude a data scientist because they are more subtle than “name”, “address” and “phone number” that are obvious PII fields. Sometimes, PII is contained in combinations of fields, something that may take a lot of effort and discernment for someone to figure out. Fortunately, that’s one of the areas AIs excel at.
Pipelines dealing with highly sensitive data
Also, if the data scientist is considered to be a liability in a particular pipeline, an AI system can be used end-to-end in that process mitigating the risk. Of course, such a system would need some supervision but that person undertaking this task won’t have any contact with the sensitive data so it won’t be as easy to take a peek. Naturally, this drastic approach would make sense for pipelines dealing with highly sensitive data that could tempt even an ethical data scientist. Such data may be characterized by high velocity, making the use of synthetic data an unpractical option.
Learning more about AI in a data science setting
This article was based on just a couple of slides from the deck I've prepared for my next webinar. So, if you are so inclined you can register for it in 3 weeks (5/18) at 10 am EDT / 7 am PDT. You can learn more about it here. Cheers!