Personally Identifiable Information (PII) is defined in Wikipedia as “Information that can be used on its own or with other information to identify, contact, or locate a single person, or to identify an individual in context.” So, we are talking about names, addresses, phone numbers, email, etc., any data that can be used to identify an individual, or any data that can tie an individual to the other records that may reveal info about someone personally. Show this data to the wrong person in your organization and you may be breaking the law, and subject to fines and the bad publicity of a privacy breech. This is why more and more hospitals, Government Agencies, and other organizations dealing with sensitive customer data are turning to Scribe to help solve issues with Data Quality and Data Privacy problems. I will share some of the insider tricks on how we can modify the data in such a way as to keep all the critical data stores clear of Private and Regulated data. In the context of our work we are often tasked with removing the PII from a particular set of data. This might be because an organization needs to set up a test or development environment – a testing or training environment that accurately represents the production data set, without using the actual data because it has PII in it. PII is fine if the user has a legitimate need to access the information. However, if a developer, for example, was working on a hospital’s systems then we wouldn’t want the developer to see all the e-mails and phone numbers of the patients. I will show you some of the methods and tricks that can be used to maintain the integrity of the data without revealing the PII. This is primarily an exercise in formulas. In the end what we want to do is to convert our PII to new fields that will still allow you the ability to perform lookups and other application functions on the data. Additionally, I have often found that once we fix information like names and e-mail addresses, the remaining data is no longer considered PII because there is no connection between names and addresses so the address data can remain untouched. Most organizations that have very strict PII rules have Data Stewards that can help you define the specific information that needs to be scrubbed. The challenge in a project like this is to maintain the value of the data for the purpose of testing the target application functions, so one cannot just randomize all the data and then expect to have look-ups working properly or the contact list in the target app be able to sort by last name. So, we need to strike a balance between the 2 needs: anonymity and quality. So lets pretend that we are going to clean my contact information of PII. We are starting with: S1 – First Name: Pierre S2 – Last Name: Hulsebus S3 – Company Name: Scribe Software S4 – Address: 1570 Elm St S5 – Phone Number: (603)488-6528 S6 – Postal code: 03104 First, I suggest stripping the left 3 characters and then upper casing them which looks like this: Upper(LEFT(S1,3)) which would return: S1 – First Name: PIE S2 – Last Name: HUL In a large data set this may result in some duplicates so I suggest adding the Row number to the data. The ROWNUM() formula will count what row we are on in the data and return this value. For example, if this name was on the 39th row, it would return the number 39. So the resulting formula would now look like this: Upper(LEFT( S1,3 ))&ROWNUM() which would return: S1: First Name: PIE39 S2: Last Name: HUL39 This same methodology could be applied to other PII in the source, like Address: S3: Company Name: Scribe = SCR39 S4: Address 1: 1570 Elm St = 15739 Phone numbers and e-mail addresses will need similar treatment with some minor modification. I suggest using that same row number and replacing the numbers in the string with these numbers. This most likely would add enough pseudo random data to make the data non-identifiable but maintain the format. I would start with a formula like this: FORMAT( LEFT( STRIP( ROWNUM()&S9, “N” ),10 ), “(###)###-####” ) S5: Telephone 1: (603)488-6528 = (396)034-8865 Applied to a Zip code with the same idea: FORMAT( LEFT( STRIP( ROWNUM()&S8, “N” ),5 ), “#####” ) S6: Postalcode: 03104 = 39031 So in the target system PII removed would look like. First: PIE39 Last: HUL39 Company Name: SCR39 Address: 15739 City: MAN39 Postal Code: 039031 Phone: (396)064-8865 I have used this methodology to successfully strip entire data sets of their PII and still allowed developers, outside contractors, and customer employees access to the system safely and without compromising the high standards the customers required.
↧