Transform, analyze, and discover insights from unstructured healthcare data using Amazon HealthLake

Health data is complex and aggregated, and exists in various formats. It is estimated that 80% of organizations’ data is unstructured or “dark” data that is locked away in text, emails, PDFs and scanned documents. This data is difficult to interpret or analyze programmatically and limits how organizations can gain insights from it and serve their customers more effectively. The rapid speed of data generation means that organizations that don’t invest in document automation risk being stuck with manual legacy processes that are slow, error-prone and difficult to scale.

In this publication, we propose a solution that automates the ingestion and transformation of raw PDF files and handwritten clinical notes and data. We explain how to extract information from clients’ clinical data charts using Amazon Texttract and then use the extracted raw text to identify discrete data elements using Amazon Comprehend Medical. We store the final output in a Fast Healthcare Interoperability Resources (FHIR) compatible format in Amazon HealthLake, so it is available for further analysis.

Solution overview

AWS offers a variety of services and solutions for healthcare providers to unlock the value of their data. For our solution, we process a small sample of documents using Amazon Texttract and upload the extracted data as appropriate FHIR resources to Amazon HealthLake. We build a custom process for FHIR conversion and test it end-to-end.

Data is first loaded into DocumentReference. Amazon HealthLake then creates system-generated resources after processing this unstructured text DocumentReference and loads it in Condition, MedicationStatementi Observation resources. We identify some data fields within FHIR resources, such as patient ID, date of service, provider type, and medical facility name.

A MedicationStatement is a record of a medication being consumed by a patient. It may indicate that the patient is taking the medication now, has taken the medication in the past, or will take the medication in the future. A common scenario where this information is captured is during the history taking process during the course of a patient visit or stay. The source of medication information may be the patient’s memory, a prescription bottle, or a medication list maintained by the patient, physician, or other party.

Observations they are central to healthcare, used to support diagnosis, monitor progress, determine baselines and patterns, and even capture demographic characteristics. Most observations are simple assertions of name/value pairs with some metadata, but some observations group other observations together logically, or could even be observations of multiple components.

The Condition The resource is used to record detailed information about a condition, problem, diagnosis or other event, situation, problem or clinical concept that has reached a level of concern. The condition could be a one-time diagnosis in the context of an encounter, an item on the practitioner’s problem list, or a concern that does not exist on the practitioner’s problem list.

The diagram below shows the workflow for migrating unstructured data to FHIR for AI and machine learning (ML) analytics in Amazon HealthLake.

The workflow steps are as follows:

  1. A document is uploaded to an Amazon Simple Storage Service (Amazon S3) bucket.
  2. Uploading documents to Amazon S3 triggers an AWS Lambda function.
  3. The Lambda function sends the image to Amazon Text.
  4. Amazon Texttract extracts text from the image and stores the output in a separate Amazon Texttract output S3 bucket.
  5. The final result is stored as specific FHIR resources (the extracted text is uploaded to DocumentReference as base64-encoded text) to Amazon HealthLake to extract meaning from unstructured data with integrated Amazon Comprehend Medical for easy search and query.
  6. Users can create meaningful analytics and run interactive analytics using Amazon Athena.
  7. Users can create visualizations, perform ad hoc analysis, and quickly gain business insights using Amazon QuickSight.
  8. Users can make predictions with health data using Amazon SageMaker ML models.


This post assumes familiarity with the following services:

By default, Amazon Comprehend Medical’s built-in natural language processing (NLP) capability within Amazon HealthLake is disabled in your AWS account. To enable it, submit a support case with your account ID, AWS Region, and Amazon HealthLake datastore ARN. For more information, see How do I enable HealthLake’s built-in natural language processing feature?

See the GitHub repository for more deployment details.

Deploy the solution architecture

To configure the solution, follow these steps:

  1. Clone GitHub repository, run cdk deploy PdfMapperToFhirWorkflow from your command prompt or terminal and follow the README file. The deployment will complete in about 30 minutes.
  2. In the Amazon S3 console, navigate to the bucket starting with pdfmappertofhirworkflow-, which was created as part of cdk deploy.
  3. Inside the cube, create a folder called uploads and upload the sample PDF (SampleMedicalRecord.pdf).

As soon as the document upload is successful, it will trigger the pipeline and you can start seeing data in Amazon HealthLake, which you can query using various AWS tools.

Check the data

To explore your data, follow these steps:

  1. In the CloudWatch console, search for the HealthlakeTextract registration group.
  2. In the record group details, note the unique ID of the document you processed.
  3. In the Amazon HealthLake console, choose Data warehouses in the navigation pane.
  4. Select your datastore and choose Run the query.
  5. For Query typechoose Search with GET.
  6. For Resource typechoose Document reference.
  7. For Search parametersenter the parameter related to and the value com DocumentReference/Unique ID
  8. choose Run the query.
  9. In the Response body section, minimize the resource sections to see only the six resources that were created for the six-page PDF document.
  10. The screenshot below shows integrated analytics with Amazon Comprehend Medical and NLP capabilities. The screenshot on the left is the source PDF; the screenshot on the right is the NLP output from Amazon HealthLake.
  11. You can also run a query with Query type establish as reads i Resource type establish as condition using the appropriate resource identifier.

    The following screenshot shows the query results.
  12. In the Athena console, run the following query:
    SELECT * FROM "healthlakestore"."documentreference";

In the same way, you can consult MedicationStatement, Conditioni Observation resources.

Clean up

After you are done using this solution, run cdk destroy PdfMapperToFhirWorkflow to make sure you don’t incur additional charges. For more information, see AWS CDK Toolkit (cdk command).


AWS and Amazon HealthLake AI services can help store, transform, query, and analyze information from unstructured health data. Although this post only covered a clinical chart PDF, you can extend the solution to other types of PDFs, images, and handwritten healthcare notes. After the data is extracted in text form, analyzed into discrete data elements using Amazon Comprehend Medical, and stored in Amazon HealthLake, it could be further enriched with downstream systems to generate meaningful and actionable health information and ultimately , improve patient health outcomes.

The proposed solution does not require the deployment and maintenance of the server infrastructure. All services are managed by AWS or serverless. With AWS’s pay-as-you-go billing model and depth and breadth of services, the cost and effort of initial setup and experimentation is significantly less than traditional on-premise alternatives.

Additional resources

To learn more about Amazon HealthLake, see the following:

About the Authors

Shravan Vurputoor is a Senior Solutions Architect at AWS. As a trusted customer advocate, he helps organizations understand best practices around advanced cloud-based architectures and advises on strategies to help drive successful business outcomes across a broad set of enterprise customers through his passion for educate, train, design and build the cloud. solutions In her free time, she enjoys reading, spending time with her family, and cooking.

Rafael M. Koike is an AWS Principal Solutions Architect supporting enterprise customers in the Southeast and part of the Storage and Security Technical Field Community. Rafael has a passion for building and his experience in security, storage, networking and application development has been instrumental in helping customers move to the cloud securely and quickly.

Randhir Gehlot is a principal solutions manager for the AWS customer. Randheer is passionate about AI/ML and its application in the HCLS industry. As an AWS creator, he works with large enterprises to rapidly design and implement strategic cloud migrations and build modern, cloud-native solutions.

Source link
Ikaroa is a tech company that has leveraged the power of Amazon Web Services (AWS) to develop an innovative solution to help healthcare providers better manage their unstructured healthcare data. Through Aws HealthLake, Ikaroa has enabled healthcare providers to securely store, transform and analyze their unstructured data, helping them to quickly uncover complex insights about their patient population.

Healthcare providers across the world are increasingly faced with the challenge of utilizing their unstructured data to gain better insights into their patients and the health of their population. Sheer amount of data and tools required to process it can be overwhelming. That’s where Ikaroa’s HealthLake solution comes into play. Utilizing the power of Amazon Web Services, HealthLake helps healthcare providers to transform, analyze, and uncover complex insights from large amounts of unstructured healthcare data.

The HealthLake solution helps its users to securely collect and store health records. Utilizing purpose-built AWS technology, HealthLake helps to simplify and accelerate the process of collecting, normalizing, and integrating patient data, whether it is received in structured or unstructured formats. The solution also helps healthcare providers to identify and gain insights into their patients’ behavior and trends that matter, enabling them to form informed healthcare decisions.

HealthLake provides an end-to-end platform that can transform unstructured healthcare data into valuable insights, helping providers to optimize care and services. The HealthLake solution is designed to be secure, compliant, and available to healthcare providers across the globe. Ikaroa’s HealthLake solution goes beyond just transforming and analyzing data, helping to effectively align data, refine data models, identify hidden correlations, and discover actionable insights.

Through the innovative HealthLake solution, Ikaroa has allowed healthcare providers to transform and analyze their unstructured healthcare data, and uncover complex insights that enable them to better understand, manage and optimize care for their patients. By leveraging the power of Amazon Web Services, HealthLake provides a secure, compliant, end-to-end platform that can transform unstructured healthcare data into valuable insights, enabling healthcare providers to make informed decisions.


Leave a Reply

Your email address will not be published. Required fields are marked *