false
Catalog
MENA 2024 Recordings
Big Data and How to use it
Big Data and How to use it
Back to course
[Please upgrade your browser to play this video content]
Video Transcription
AI Hub at the American University of Beirut in Lebanon. I was truly looking forward to participate in person, but unfortunately due to the ongoing conflict in Lebanon and the travel uncertainties, I was not able to join you this year. I hope next year I'll be able to be in person. So meanwhile, let's dive into the fascinating world of big data. In my talk, I will take you on a tour around some data topics which are dear to my heart. So I'll try to be as brief as possible because each topic listed here can be the subject of a standalone workshop. I'll begin by defining data and its key characteristics. Then I'll explore how we can unlock potentials in data. And finally, we'll conclude by addressing the unique challenges and critical considerations for overcoming challenges of data in medicine. So what is big data? And maybe we should start by saying what is a data point? Well, simply put, a data point is a single measurable unit of information within a data set. So it represents specific attributes or variables and serves as a fundamental building block for analysis and interpretation. So in general, data points are critical in fields like statistics, machine learning, data science, because collectively they enable the modeling, prediction, and understanding of complex systems. So each point contributes to the broader picture and help identify patterns and validate hypotheses and inform decision making. In the context of healthcare, we can name a lot of potential benefits and applications. Among them here in this picture, we can see personalized medicine, predictive analytics, precision medicine, drug discovery, population management system, clinical decision making, clinical research, disease prediction, and outbreak control. So if we want to look more closer to the amount of data that was generated globally for year 2023, we can say that indeed the healthcare industry is a big contributor. Just in 2023, the world has created 120 zettabytes. So a zettabyte is 1 sextillion, so it's 10 to the power 21 bytes. That's a huge number. So daily we've been producing around 337,000 petabytes, and eventually UNI has been contributing daily around 16 terabytes. So terabytes is probably a computing unit that we have heard of before. And hospitals have generated around 50 petabytes. So that's another quantillion value of data. So if you want to do a fun exercise, we can try to store this data that was produced in 2023 on iPads of capacity 2 terabytes, and the thickness would be around 5.6 probably. So how long do you think the tower will be? Well, if we stack them, it's going to be amounting to a tower of 354,000 kilometers. So that's short of the actual distance from Earth to the moon, which stands at 384. So really this tower is enormous. It symbolizes the vast potential of this data and how much really it can impact medical care and how much it can really benefit profoundly humanity if it is well dealt with. So let's get started with characteristics of big data. Of course, the first one that comes to mind, we're just discussing the volume in 2023, is the actual size. When we talk about big data, we refer to big sizes. So an example, medical devices, health trackers, genomic sequencing, all of this are producing biological data sets that are big and it's at unprecedented scale. Another characteristic is velocity. Velocity is basically the speed at which the data is generated. So when we have big data, we're talking about high speed of generation. So here's a quick example. Abbott has reported for their Freestyle Libre system 5.2 million users by June 2023, whenever they were at 4.6 million at the end of 2022. So this indicates basically a velocity of around half a million for just six months. Another characteristic of big data is variety. And when we talk about variety, we are referring really to the nature of data. So data can be structured. So it's in a form that you know, expected by a certain format for a database. It could be semi-structured or unstructured, nothing that we can be, have been handled basically through a very formal data structure. And it can also be from heterogeneous sources. So if we look, for example, for data that has been generated from glucose monitoring devices, coupled with hormonal test sets and electronic health records, all of them together, millions of diabetic patients globally contributing to this data set. So we're talking about really a very rich variety in that case. Veracity is the fourth characteristic we want to describe here. And really it describes the trustworthiness and the quality in the data. Because when we are dealing with vast amount of data, it's important, it's essential that we have a good quality because we want to ensure that the accuracy and reliability of the models that are built on these data sets basically can lead to correct and good decisions. A fifth characteristic is value, right? The data it's worth really lies into the insights, the benefits, the business value that it can provide. So to be able to extract this information, big data has to have a good amount of value to make sure that it can really lead to innovation and competitive advantages. Finally, variability is a very important characteristic of big data because it reflects the inconsistency that can be present in the data. In general, data flow can be highly inconsistent because you have periodic peaks, unusual events that happens, and all of this make data load hard to manage. So it's important to understand and manage variability to make sure that the data analysis is accurate. So what we have seen now is what we call the six Vs. It's to be noted that very early on, large data used to be defined by three Vs, the volume, variety, and velocity. But you know, these characteristics were augmented by additional Vs that has been introduced to better describe the complexity and the data that you have today. Let's have a quick somehow test your knowledge on a scenario-based question to see how we can identify the six Vs that we just covered. So imagine there is a patient using CGM's devices and generating glucose every five minutes. You as a clinician, you're collecting this information along with monthly lab reports and patient-reported dietary logs. So if we really look at these six Vs that we discussed, how can we best describe them within this context that, you know, this specific scenario? So basically, it's the six Vs are there because the volume definitely is there. You have continuous data from the CGM device. The velocity of this five minutes is the speed at which the data is generated. The variety is there because as a clinician, you're using different heterogeneous data types, real-time glucose, lab reports, and dietary logs. The veracity is definitely also part of this scenario because we need to ensure that the patient is recording accurately their dietary information. The value of this data is amazing to adjust insulin therapy, and the variability does exist, and it's basically captured by the uncertainty around, you know, the factors that are not captured, and they influence the reading of or the measurement of the glucose value. Okay, so now let's look at how we can unlock the value, the actual value of data. We have heard this before. The data is the new role, and I would say, you know, diesel, now before really being more on the renewable side, is to the car what data is for AI. So every time when you think about value and challenges of big data, it's often linked to what we see as value and challenges in an AI system. So for this, let me just go on a small excursion here and define some terms before we proceed further because many times we have seen that AI and machine learning has been used interchangeably, but as a technologist, it's important for us to identify the differences. So AI is this simulation of human intelligence in machines, enabling them to perform tasks like reasoning, learning, and decision making, whereas machine learning specifically, it's this subset of AI that uses algorithms to enable systems to learn and improve from data without being explicitly programmed. And deep learning is this subset of machine learning that leverages artificial neural networks, which resemble to a big extent human neurons, and they have multiple layers, and they process and analyze complex data patterns, and this data is basically big data. Also another quick illustration between the difference between programming and machine learning. So in programming, developers, they write rules, they write logic for a computer to follow. So the output here becomes somehow dictated by the program logic, whereas when we talk about machine learning, we have the data that is being fed, the actual behavior or the output that is desired, and it's up to the computer or the model to learn, based on this data, some association and some program basically that is going to be adaptable to unseen scenarios. And this really is a big shift between machine learning and the traditional programming environment, because it can handle complex and dynamic problems that are difficult to define with static rules. Okay, so back from our excursion here, let's go back to the market, the value of the data. So we want to check it from a market perspective and also from a publication perspective. So from a market size, numbers are really high. It's expected that the revenue forecast in 2030 will be up in the high $180 billion. If currently we look at the market value that reached, that was achieved in 2023, just for the US, a combined aggregation of hardware, software and services, we talk about $9.7 billion. And if we talk about the global AI market in general, lumping, machine learning, natural language processing, computer vision, and effective computing all within the healthcare market, this number goes to $19.3 billion. So this is really a big, these are big values. And also when we look at the scientific publications, as for me as a technologist who has been working on AI for more than 16 years and championing a lot of our effort in AI in medicine, I'm very pleased to see that the surge in the academic and scientific publication is also significant. So in the past few years, according to a PubMed survey and scoping review, we see that the numbers in 2023 are about 23,000 papers. Back from a 5,000 total number between 2000 and 2009. So now we've seen the value, the volume of the data, let's try to see how we can unlock value in big data. What do we have as tools? Visualization is the first that comes to mind because it can transform complex data set into intuitive graphical representation and we can directly grasp insights and identify patterns that are usually that could be missed and basically in raw data. AI and machine learning, obviously, they can enhance data analysis because they automate processes, they can uncover hidden patterns and they can make predictive analysis very efficient. And of course, statistical analysis has been the scope of medical statisticians and research for so many years. It's still a very important factor to unlock value in big data. And because it helps us understand the data distribution and it is important to infer conclusions about the data population in general. So while these three elements can unlock value in data, the most important or the more powerful way is to combine them together because combined visualization, AI, and statistics, the healthcare industry can really enhance decision-making because they can visualize complex representation, they can have while integrating statistical methods and visualization, they can have an improved efficiency and faster identification of trend and anomalies, which would help them reach deeper insights because they have uncovered unhidden patterns and correlation that have been probably went missing using traditional methods. And finally, all of this would really feed into allowing healthcare providers to deliver more personalized, more efficient, more effective care, which will improve basically patient's satisfaction and health outcomes in general. So what are different steps that are recommended to use so when working and trying to unlock value in big data is first really to make sure that the objective is very well defined, the goal of using the data, what is it that we are looking into? Is it increasing the revenue, improving healthcares, outcomes, whatever it is. It's important to make sure that the data collection follows a very clear plan in terms of the devices that are being collected and the data management plan basically. Data processing and cleaning is also important because data has to be prepared for analysis, even though it's going to be a deep, let's say learning approach, you still have to look into the processes of the data acquisition. Data analysis needs to really be well designed as well to make sure the selection of the machine learning is the most suited given the specifics of the data. And then finally, the decision making is also formulated in a way where it informs strategies and automate processes. So what do we have at our hand as in technologies for dealing with big data? We have a lot that are grouped here based on families or flavors. I will not go over all of them because of the sake of time, but for example, if the major concern or the major outcome for the task in big data is to find data storage technologies, there's a lot of them out there in the market. Same for the analytics framework, for the data integration, for the big data querying, so that it's a fast process from a database technology, machine learning and AI frameworks, stream processing, technologies, data visualization, data governance, monitoring and logging for the tools. And then if it is a cloud-based big data platform that you are after, or if we are basically chasing big data workload automation. So there are a lot of tools in each of these categories. I just compiled here a small list because I don't want to be cluttering your brain and your eye with a lot of these companies, but here it's data storage technologies, Apache, probably you've heard of it, the Hadoop, the MapReduce are good solutions for data storage, data processing and analytics frameworks. For example, Hadoop, Amazon, Google Cloud, Microsoft Azure, and you name it. And here are some other frameworks for machine learning and AI such that TensorFlow, PyTorch, Scikit-learn, Keras, and H2O, for example. Okay, so I think we've been talking about the sky being the limit in terms of big data and AI and what it can do for the healthcare, but this is also coming at the expense or it's coming with a lot of challenges and they need to be acknowledged and they need to be worked on for this solution. So when we talk about data, remember data is really representing society. So with its inequities and cultural prejudice, like really we always say data has a political, social, and economical dimension. So with that, we do understand that the data is presumed to accurately represent the world. However, there is this gap between maybe a little or maybe no data from certain communities. So especially when it comes to race, economical status, and gender. So this is what is really creating the bias in data. This gap is there and it's the systematic error and underrepresented distribution in the data set that is basically trained, validated, and tested on in algorithms that are going to be used for decision-making. So this bias significantly affect the performance or fairness of AI systems and they lead to unintended and potentially harmful outcomes. So it's very important to understand the sources and really the implications to develop better ethical, equitable, and reliable AI systems. Now, so in general, bias is not single sourced. It is part of the whole AI framework and flow. It's the first building blocks. So at every block of an AI workflow, it's important to look at the source of bias and try to see how we can mitigate it. All the way from the initial data analysis to the collection of the data, the selection of the feature, the training of the model, if it's possible to select models that are less biased than others during validation to use these fairness metrics that really put an additional stress on the system to be more fair. And then of course, it's very important to have this ongoing monitoring and auditing because of any potential data drift or the like, and then of course, the human in the loop. When we say human in the loop, it means the expert is still involved in the outcome of the AI system. So all of this needs to be done because healthcare data is basically creating systems that should enhance the patient trust. It should mitigate risk. It should promote accountability. It should ensure fair treatment. And for this, we have at least to make sure that the patient privacy is safeguarded. This can be done by making sure that the personal identifiers from the data is removed, cannot be traced back to the individual. Only the data that is necessary is basically recorded. And of course, the patient is aware of the usage of his or her data. There's an informed consent that is required that can be withdrawn at any time. So let's look here at a quick scenario also based to see the bias and the misdiagnosis that can result from it. So I assume you are contemplating to acquire an AI system that manages diabetics. And it was developed. It's very well trained and high accuracy, amazing. It was developed using a data set that predominantly come from patients that are from a different ethnic group than your own clinic. And you are planning to use this system, to buy this system, to use it in your clinic that is mostly represented by minority groups. So in this scenario, right? So where's the source of bias? And basically we need to make sure that the potential misdiagnosis and the approach can really improve on this one. Of course, the system is biased because it's going to be used by minority groups. The misdiagnosis is because it was basically trained on a different data set. And then when it's going to be applied for the minority group is going to be maybe misidentifying the symptoms. The best way is to go and gather a wider range of groups and demographic data. And because this is so important and so critical, there has been a lot of effort recently about governance for the AI and the data environments. I mentioned here the EU because they've been probably working much earlier than any other group on putting regulation for AI. AI Act was enacted very recently. But what's important to note is the Data Governance Act and the GDPR that really facilitate data sharing across EU and making sure that the data collected from private and personal users is well-maintained and under strict requirements for processing. Who also have some AI regulation because they see the value add of the medical practices. And also they understand the importance of putting these ethical framework and guards to make sure that an effective and safe AI is in place. So really here, we're talking about making sure that the healthcare providers and the AI developers together are accountable to make sure that there is a clear line of responsibility for ethical considerations. So I like to close with this quote that probably how you have also heard, AI won't replace you, but someone doing AI will. And as a certified assessors and ethical AI and advocate for AI and governance, I like also to emphasize further this idea that we discussed that AI and the human, so the human in the loop and AI going to deliver a better care together and AI should be aligned with human ethics and standards. And it's up to you practitioners and us developers to make sure that the alignment is there. I really thank you for your time. You can find me on email, mariat.hawad.eub.edu.lb. If you have any questions or on LinkedIn under my profile, Mariette Hawad, I would love to hear from your perspective as medical specialists that are looking to include AI in your practice. And I would like also to close in here and saying that I have no conflict of interest for any of the medical or also commercial names and companies that I mentioned in my presentation. Thank you.
Video Summary
The speaker addresses the intricacies and vast potential of big data, particularly in the context of healthcare, while acknowledging the ongoing conflict in Lebanon as a barrier to their in-person participation in an event. They discuss the six Vs of big data: volume, velocity, variety, veracity, value, and variability, shedding light on how these characteristics impact data analysis and decision-making. Emphasizing big data's role in healthcare, the speaker envisions advancements such as personalized and precision medicine, predictive analytics, and improved clinical decision-making. They advocate combining visualization, AI, and statistical analysis to enhance these benefits. Notably, the speaker highlights the ethical challenges, potential biases, and the need for governance frameworks in AI and data usage to ensure fair, equitable, and accountable systems. They underscore the vital role of collaboration between AI developers and healthcare providers to ensure AI aligns with human ethics and improves patient care outcomes.
Keywords
big data
healthcare
AI ethics
predictive analytics
data governance
personalized medicine
×
Please select your language
1
English