Ethical user data collection and machine learning

Today on Cyber Work Ché Wijesinghe of Cape Privacy talks about the safe and ethical collection of user data when creating machine learning or predictive models. When your bank weighs whether to give you a loan, they can make a better choice the more info they know about you. But how secure is that contextual data? Hint: not as secure as Wijesinghe would like!

– Get your FREE cybersecurity training resources: https://www.infosecinstitute.com/free
– View Cyber Work Podcast transcripts and additional episodes: https://www.infosecinstitute.com/podcast

  • 0:00 – Machine learning and data collection
  • 2:37 – Getting started in cybersecurity
  • 3:15 – Being drawn to big data
  • 4:35 – What data is driving decision-making?
  • 9:04 – How is data collection regulated?
  • 15:02 – Closing the encryption gap
  • 16:50 – Careers in data privacy
  • 19:07 – Where can you move from data privacy?
  • 21:20 – Ethics of data collection
  • 23:25 – Learn more about Wijesinghe
  • 23:55 – Outro

  • Transcript
    • [00:00:00] Chris Sienko: Cyber Work listeners, I have important news before we dive into today’s episode. I want to make sure you all know that we have a lot more than weekly interviews about cybersecurity careers to offer you. You can actually learn cybersecurity for free on our InfoSec skills platform. If you go to infosecinstitute.com/free and create an account, you can start learning right now.

      We have 10 free cybersecurity foundation courses from podcast guest, Keatron Evans, 6 cybersecurity leadership courses from also podcast guests Cicero Chimbonda, 11 courses on digital forensics, 11 courses on incident response, 7 courses on security architecture, plus courses on DevSecOps, Python for cybersecurity, JavaScript security, ICS and SCADA security fundamentals and more. Just go to infosecinstitute.com/free and start learning today. Got it? Then let’s begin today’s episode.

      Today on Cyber Work, my guest is Ché Wijesinghe of Cape Privacy. Che and I have a great talk about the safe and ethical collection of user data when creating machine learning or predictive models. When your bank is weighing whether or not to give you a loan, they can make a better choice, the more they info that they know about you. But you have to ask, how secure is that contextual data? Hint, it’s not as secure as Ché would like. So, get the inside data privacy scoop today on Cyber Work.

      [00:01:30] Ché Wijesinghe: Welcome to this week’s episode of the Cyber Work with Infosec Podcast. Each week, we talk with a different industry thought leader about cybersecurity trends, the way those trends affect the work of Infosec professionals and offer tips for breaking in or moving up the ladder in the cybersecurity.

      Ché Wijesinghe has over 25 years of experience in the enterprise software industry as a senior executive and as an entrepreneur. Ché has a proven track record of building teams that deliver significant revenue growth and demonstrable business benefit. He was previously SVP of OmniSci, Global Head of Data Analytics Sales at Cisco, and EVP at Composite Software. So, with more and more regulations around data privacy and data collection going into place, the field of data collection for the purpose of machine learning models, by comparison still feels kind of like the Wild West. So, based on the stream of data, this being accessed by those technologies and companies, Ché believes there’s an opportunity to improve how and where data is used. So, these types of fine grain and ethical considerations are our bread and butter on Cyber Work, so I’m looking forward to getting into it.

      Ché, thanks for joining me today. Welcome to the show.

      [00:02:35] CW: Hi, Chris. Nice to be here.

      [00:02:36] CS: First question, just to get a little background on you. Where did you first get interested in computers and tech and what got you first excited about cybersecurity? What was the initial drop?

      [00:02:48] CW: Sure. I was given my first computer in 1981 by my mom. As a small child, I had spent hours learning how to code and play video games. That paved the way for my future undergraduate studies and career in the software industry. While I was at Cisco, I became interested and involved in cybersecurity software, the business and technical challenges around securing and protecting data are massive and I truly appreciate, it’s a fast moving and fascinating field.

      [00:03:12] CS: Oh, yeah. Now, what was the initial appeal of big data and data analytics? What were some of the hurdles and projects and milestones along the way that got you to where you are now? Because I mean, obviously, one day you look and you’re like big data, this is for me. What were some of the first projects that you did in that space?

      [00:03:29] CW: Yeah, I first started working on data and analytics projects while I was at Deloitte Consulting. I was involved with large enterprise, data warehouse implementations. Those projects were resource and time intensive, the implementations took months, sometimes spanning over a year to complete complex solutions that involve both IT and the business to align on a transformation initiative. I also spent a number of years at Composite Software where we tackled difficult big data, data integration, challenges for many global companies. I think it was there that I truly began to appreciate the growing problems associated with data acquisition and data management.

      [00:04:08] CS: Sorry, what year did you say that was roughly?

      [00:04:12] CW: So, 1997 was when I was at Deloitte and then subsequently, 2006 at Composite Software.

      [00:04:21] CS: Gotcha. So, let’s start at the beginning here when we – because a lot of our listeners are kind of new to the field or just trying to dip their toe in and see what areas of cyber is interesting to them. Let’s just start at the beginning. When we speak about, as we discussed before the show, user data as it regards machine learning, analytics and making technical decisions. What type of data are we speaking about? Can you give me a concrete example of a type of data that’s regularly use to drive decision making through modeling?

      [00:04:51] CW: Sure, Chris. If we use like a really common example everyone can understand, getting credit from a bank. 20 years ago, this process was done manually using paper. 10 to 15 years ago, it moved to electronic submissions of documents through email, and rules-based processes to help automate the application. Today, banks can access customer financial data much more readily based upon data access, such as credit scores, that are so much more easily available. Regardless, all of these decisions are based on a past history.

      Now, imagine if you could predict in a privacy preserving manner which customers would be better longer-term customers based on data that today is less accessible due to its sensitive nature?

      [00:05:32] CS: Could you give me some examples of these less accessible types of data?

      [00:05:37] CW: Yeah, I mean, it’s really anything that’s confidential. So, your customer information, all of your credit history, your purchases, where you shop, where you have loans and mortgage, outstanding credit.

      [00:05:52] CS: So, you’re also sort of tracking things like, “Oh, my credit score got really good just in time to buy a house.” But at the same time, if you look at the payment history, that was a, that was a recent development or something like that.

      [00:06:04] CW: Yeah, a great example, one that I often use is you have a married couple, and the husband, let’s say that the wife applies for the loan, but the husband actually has terrible credit. And so, it would literally be about to link the wife’s credit with the husband’s credit, in order to be able to make that credit decision. These decisions, I mean, they’ll trawl through mountains of data sometimes. So, being able to have all that correlation of all that confidential data across family members, where there might be loans outstanding or other debt, and making sure that the person that’s applying for that loan, or that credit line is actually someone that’s going to be a good customer.

      [00:06:50] CS: Yeah. Is above border, can at least answer your detailed questions. To that end, you said that it would be nice if these types of data points are available. But it sounds like that means that they’re not right now. Can you tell me about sort of where this is? If they’re sticking points, or if there’s resistance, or sort of where are we at this very moment?

      [00:07:14] CW: I would argue, Chris, that the data is actually available, but it’s not being protected. So, we have situations where this data is in the cloud. You’ll use different cloud services, whether it’s iCloud or things like Alexa or Siri, where data is being transferred, but not necessarily privacy preserved. So, in the example of the bank loan, that information could be shared unencrypted, between different providers, between different partners. And so, people’s personal information is actually at risk of being breached.

      [00:07:49] CS: Gotcha. So now, is there a big difference between how data is protected via compliance regulations like CCPA or GDPR, versus how it’s used for things like machine learning models described here? Are there different regulations? It almost sounds like there’s not enough regulations to your taste in the way it’s done right now with the cloud and stuff. So, how are these types of use and levels of production potentially different?

      [00:08:13] CW: Yeah, great question. I think regulations like GDPR and CCPA, are just frankly, not specific enough to include machine learning applications today. So, there is some good regulation, some good precedent, but we still got a long way to go, big tech, as everybody knows.

      [00:08:30] CS: Yeah, because it’s two completely different applications of whether you’re storing it, or whether you’re using it. I use it for modeling and stuff.

      [00:08:37] CW: Yeah, I mean, there really needs to be clear regulation to preserve the privacy of confidentiality of data use for artificial intelligence and machine learning, especially when you consider the volumes of data being collected every day, Alice, Siri, Alexa, et cetera, right?

      [00:08:53] CS: Yeah. Okay, go ahead, sorry.

      [00:08:54] CW: I was just going to say, but regulations are only part of the solution. We can also implement better technology to protect the privacy of the data.

      [00:09:02] CS: Yeah, well, that’s great, because that was my next question here is I wanted to ask if you could sort of speak about how today’s computer devices and enterprise platforms are being regulated against collections of user data, especially regarding the use of artificial intelligent, machine learning for data privacy? Are there things from a device level or from a cloud level or platform level that you would do better if you could?

      [00:09:25] CW: I think as many of us are aware, there is a data resource that’s being created by a myriad of connected devices. As an example, it could be petabytes of voice data. I mentioned platforms like Amazon Alexa, Google Assistant, or Apple’s Siri. Today, much of that data is potentially sent to the cloud unencrypted and transcribed into plain text, which could be used by machine learning and artificial intelligence applications for a whole host of applications and services. That could be for product recommendations and advertising, right?

      Again, one day, we’re talking to Alexa, and the next day, we’re getting pitched new products. So, at a minimum, from sort of our perspective, that data should be fully encrypted at rest in motion and then most notably, when it’s actually being processed or in use. If regulation and security was increased to enforce the level of data privacy, I think it would give consumers more confidence to share their data.

      [00:10:24] CS: Yeah, now, I feel like a lot of the sort of data privacy that’s happened in the sort of storage and collection space, that GDPR has – whether it’s been – you can argue whether it’s been well implemented, or whether it goes far enough or too far or whatever. But it sort of forced the hand in terms of like creating solutions, whether – I mean, at this point, everyone’s clicking accept cookies at all times and stuff. But do you think that a similar type of forcing the hand action would bring the tech into space, into the space here? Or does it need to sort of happen more organically?

      [00:11:10] CW: I’m going to think, regulation, absolutely could be a forcing hand. I think people should be doing it more because it’s the right thing to do, versus having regulation force it. But we saw this in 2008, with Dodd-Frank, where suddenly all the financial institutions had to suddenly provide the transparency and the regulatory controls were put in place. So, could be very similar with data privacy.

      [00:11:38] CS: Okay. Do you get a sense that there’s maybe concern or fear about misusing user data and that is preventing some organizations from taking full advantage of data collection, machine learning modeling? Do you think there’s a sense that rather than do something wrong and get in trouble, rather just not dip our toe in the water at all? Or is it the other way where we’re all doing it and no one’s really thinking about the consequences, and then – yeah.

      [00:12:04] CW: I think there’s definitely a general concern amongst most companies about how to handle data privacy. Some are going ahead within the limitations of the current laws and regulations where others have legacy systems, which make this much more difficult. I think while the law is not all encompassing, there’s strong sentiment, and then ongoing ethical debate about the use of private data. As I mentioned before, it shouldn’t be a question of asking for permission versus begging for forgiveness. We need to develop a framework to protect everyone’s privacy by default.

      [00:12:34] CS: Yeah, that makes sense. Now, from a technical consideration, you mentioned just sort of end to end encryption. But can you walk me through what a technical base – once the “regulations” have come through, and we’ve decided to move forward with intentionality here, what would the technical considerations be to sort of secure all this data better? Can you build like an example model for me in your head?

      [00:13:04] CW: Sure. I mean, if we talk about security of data in the cloud, as an example, and that’s probably one of the most important considerations, especially speaking with a lot of CISOs, Chief Information Security Officers, and they’re implementing things like tokenization technologies to protect sensitive information. While that solution actually does mostly secure the data, that actually makes the data completely unusable unless you decrypt it, or detokenize it. So obviously, that creates security risks. I think what we’re hearing is that the chance of a data breach, which could significantly have both reputational and other significant consequences for any firm, that companies continue to make this mistake, which is something that needs to be addressed.

      [00:13:53] CS: You think there’s any kind of blank as a service type situation here where – for people who might benefit from doing this kind of data modeling and so forth, but, I said before, are sort of squeamish about doing it. Do you have any advice for people who are jittery about these types of big projects to jump in and feel like they have sort of like some protection in mind or anything like that?

      [00:14:19] CW: I think companies that are not taking advantage of their data are likely losing competitive advantage. Companies have better data, we’re in a position to make more informed product recommendations, reduce customer churn, and better protect against financial fraud as three great examples. There are many ways to implement data solutions with varying levels of security and data privacy. My personal recommendation for early adopters will be to start with a smaller lower impact dataset that can be piloted once that workflow has been proven, the process can be scaled up and over to higher value data to derive broader and deeper business insights.

      [00:14:54] CS: Gotcha. So, this might be sort of – I think, this is kind of a Venn Diagram that we can discuss, but I want to sort of ask you from another angle here. So, you mentioned in our pre-interview context that you wanted to discuss some of the latest revelations around data privacy and why closing the encryption gap is so important for business. Can you speak about that term, when we talked about this, the hiring gap and the skills gap, and so forth, with the encryption gap, and what some of the primary issues with encryption or lack thereof are in today’s cybersecurity landscape?

      [00:15:24] CW: Sure. The encryption gap pertains to the fact that much of the data in the cloud is unencrypted. That means that I was mentioning earlier, if the data is breached, it could potentially be exposed as plaintext and human readable, which could obviously create significant consequences for anyone.

      Fundamentally, one of the primary issues with encryption is that when the data is encrypted, it becomes unusable and has little to no utility. Imagine a world where all of the data in the cloud is encrypted to protect privacy by default, we could actually still derive value for the encrypted data. We’re not there yet, Chris. However, we need to continue innovating, especially encryption technologies, which allow the utility of the data for important functions like running AI predictions and other analytics.

      [00:16:11] CS: I mean, when you say we’re not there yet, is this need to be sort of like an attack or an innovation jump? Or is it just need to be a lot more people sort of like buying in?

      [00:16:21] CW: I think it’s both, right? I mean, organizational change with change management, and of itself has fundamentally been one of the biggest blockers to implementing technology. And then even from a technology perspective, Cape Privacy deals with privacy preserving machine learning capabilities and we’re still very much in the early innings there while we’re proving the capabilities. Now, we’re scaling the capability so that they can be used by large companies and small companies as well, really, for everybody.

      [00:16:49] CS: Gotcha. Now, I want to sort of pivot over to the work side of the Cyber Work podcast and talk about careers in this space. So, from a work standpoint, you have any tips or advice for students or cybersecurity career aspirants who want to work in the realm of data privacy? Are there some experiences in this day and age or self-initiated projects that they should be engaging in now to make themselves more desirable to potential employers?

      [00:17:14] CW: Yeah, I mean, one of the things that attracted me to the space was really the fact that cybersecurity is a fast-moving sector. Our successful hires have experience in building security, privacy, machine learning, and cloud technology platforms. I would say being familiar with the latest trends in technologies, in the areas of the business for both personal and professional development will be a great starting point, especially to be more desirable to potential employers.

      As an example, we recently re-platformed our entire software on Rust. So, that’s a great sort of technology that’s high in demand for new developers. So, learning Rust would be a great sort of angle there. Our team also has a lot of open source experience. If you’re new to the game, contributing to projects, as part of an open source community is a great way to get involved.

      [00:18:05] CS: Okay, so I’m guessing you’ve probably do some hiring personally yourself. What are some things you’d like to see on a resume? Or how do you like find out about potential people who could work for Cape Privacy? What are the things you have to see on a resume? Or what are the things that indicate that this person has the sort of passion or interest or can learn the tech as long as they have the interest or excitement about it?

      [00:18:29] CW: I think that’s the key, right? It’s the passion and the interest to learn, to show that they have the aptitude to adapt and change, especially for an early stage company like ours, to be able to pivot and to be able to turn on a dime. So, certainly in terms of what I look for in resumes, it’s going to be obviously fed. Their undergraduate, they’ve shown achievements during school, as I said, if they’ve contributed to open source projects. If they’ve self-taught themselves things like Rust or Go, Python, these sorts of things, taking the self-initiative, would certainly be an indicator of a good hire.

      [00:19:06] CS: How about with regards to sort of moving up the ladder? Once you’re in in the door, what are what are things that a data privacy person does to kind of level themselves up and take on more responsibility and sort of higher titles and so forth? What are some things you recommend in that regard?

      [00:19:23] CW: I think, again, our space is fairly new, but having experience obviously, with cloud technologies, and machine learning, artificial intelligence, the privacy segment is obviously something that’s developing rapidly. So again, if you’re sort of middle management and moving up, et cetera, someone that’s had years of experience with cloud infrastructure, with machine learning technologies, artificial intelligence, I would say, those are obviously some of the key things that we look for as well.

      [00:19:53] CS: Gotcha. Okay, so this has been great. As we wrap up today, Ché, can you tell us about your company Cape Privacy, the services you offer your clients, and some of the big updates or projects you’re looking forward to working on and unveiling in 2022?

      [00:20:07] CW: Sure. So, Cape provides a self-service cloud platform for running AI predictions on encrypted data most specifically, without decryption. We’ve recently launched a new product that enables snowflake users to do this securely. Cape’s getting great traction within the financial services industry, due to its highly regulated nature, and confidentiality of much of its data. How we anticipate this to grow into other industries, we’re getting a lot of interest from healthcare, telco, and most recently from the US Federal Government.

      [00:20:37] CS: Yeah, so I guess big data in this regard is going to be sort of across all of the platforms in the future here?

      [00:20:46] CW: Yeah, absolutely. I think, every industry is dealing with the data exhaust, as we mentioned earlier, and I think that everyone has to address data privacy, and encryption, specifically. I mean, one of the things that we are trying to, again, adapt and evolve our technologies to make it more developer first, so that every developer, every engineer on the planet, I think there’s something like 20 million engineers, that they would able to leverage our encryption technology as part of their infrastructure and their code base.

      [00:21:18] CS: Gotcha. Now, yeah, I mean, do you have any final thoughts in regards to just the sort of ethics of it and how you see these things going forward, and how you hope they are and what you’re afraid, might happen with regards to going towards a universal adoption of the encryption gap and big data and so forth?

      [00:21:40] CW: I’ve said this already, but I’ll say it again. I think that we, in a perfect world, all of our data in the cloud will be encrypted. So, in order to get some sort of utility from that data, that’s bulletproof, you’re going to need services like Cape’s technology in order to be able to run analytics, to run machine learning AI. That’s really the transformation and pivot that we need to it’s a paradigm shift, frankly, Chris.

      We often use the example of the electric car when we’re talking about innovation. And 20 years ago, if Elon Musk or someone had taken the Tesla to the to the designers or the product team at Ford, they would have probably been politely asked to leave. Whereas now, everyone’s moving to the electric car. I think, fundamentally, that’s what we are now, building with our technology, it’s innovation. No one’s looking for encrypted data in use as a fundamental capability, as an engineering team. So, everyone’s looking for a horse or faster horse we’ve got the equivalent of electric car now. It’s a paradigm shift for people to get their heads around doing this, because it is possible to be able to do machine learning on encrypted data and to have that as widespread and pervasive across the engineering base.

      [00:23:08] CS: Okay, that’s good to know. So, for people who are also looking to enter the space, you have to realize that you need to both have the technical know-how, and also have a bit of PR in you to sort of change hearts and minds here.

      [00:23:20] CW: Exactly, exactly. There’s a little bit of an evangelist.

      [00:23:23] CS: Absolutely. One last question for all marvels of our listeners who want to learn more about Ché Wijesinghe, where should they go online?

      [00:23:32] CW: I don’t have a huge social media presence. I am on LinkedIn and I tweet rarely, but I’m also on Twitter. But feel free to connect with me on LinkedIn.

      [00:23:44] CS: And Cape Privacy is at?

      [00:23:45] CW: Capeprivacy.com.

      [00:23:47] CS: Okay. Well, Ché, thank you so much for joining me today. This was really fun.

      [00:23:50] CW: Absolutely, Chris. Thank you so much for the time.

      [OUTRO]

      [00:23:53] CS: As always, thank you to everyone who is listening to and supporting Cyber Work. New episodes of the Cyber Work podcast are available every Monday at 1 PM Central both on video on our YouTube page, and on audio wherever you find podcasts are downloaded.

      I want to make sure that you all know that we have a lot more than weekly interviews and cybersecurity careers to offer you. You can actually learn cybersecurity for free on our Infosec skills platform. So, please go to infosecinstitute.com/free, and if you create an account, you can start learning now. We’ve got free cybersecurity foundation course, cybersecurity leadership courses, digital forensics, incident response, security architecture, DevSecOps, Python for cybersecurity, JavaScript security, ICS and SCADA security fundamentals and more. Again, go to infosecinstitute.com/free and start learning today.

      Thank you so much once again to Ché Wijesinghe and to Cape Privacy, and thank you all so much for watching and listening. We’ll speak to you next week.

Free cybersecurity training resources!

Infosec recently developed 12 role-guided training plans — all backed by research into skills requested by employers and a panel of cybersecurity subject matter experts. Cyber Work listeners can get all 12 for free — plus free training courses and other resources.

Weekly career advice

Learn how to break into cybersecurity, build new skills and move up the career ladder. Each week on the Cyber Work Podcast, host Chris Sienko sits down with thought leaders from Booz Allen Hamilton, CompTIA, Google, IBM, Veracode and others to discuss the latest cybersecurity workforce trends.

Q&As with industry pros

Have a question about your cybersecurity career? Join our special Cyber Work Live episodes for a Q&A with industry leaders. Get your career questions answered, connect with other industry professionals and take your career to the next level.