Amazon Texttract is a machine learning (ML) service that automatically extracts text, handwriting, and data from any document or image. Amazon Texttract has a tables function within the AnalyzeDocument API that provides the ability to automatically extract tabular structures from any document. In this post, we discuss the improvements made to the Tables feature and how it’s easier to extract information in tabular structures from a wide variety of documents.
Tabular structures in documents such as financial reports, pay stubs, and certificate of analysis files usually have a format that allows for easy interpretation of the information. They also often include information such as table title, table footer, section title, and summary rows within the tabular structure for better readability and organization. For a similar document prior to this enhancement, the Tables appear within AnalyzeDocument
would have identified these elements as cells and did not extract headers and footers that are outside the table boundaries. In these cases, custom post-processing logic was required to identify this information or extract it separately from the API’s JSON output. With this announcement of enhancements to the Table function, extracting various aspects of tabular data becomes much simpler.
In April 2023, Amazon Texttract introduced the ability to automatically detect titles, footers, section headings, and summary rows present in documents using the Tables feature. In this post, we discuss these improvements and provide examples to help you understand and use them in your document processing workflows. We’ll explain how to use these enhancements using code examples to use the API and process the response with the Amazon Textractor library.
Solution overview
The image below shows that the updated model not only identifies the table in the document but all the corresponding headers and footers. This sample financial report document contains the table title, footer, section title, and summary rows.
The Tables feature enhancement adds support for four new elements to the API response that allows you to easily extract each of these table elements and adds the ability to distinguish the table type.
Table elements
Amazon Texttract can identify various components of a table, such as table cells and combined cells. These components, known as Block
objects, encapsulate details related to the component, such as bounding geometry, relationships, and confidence score. A Block
represents elements that are recognized in a document within a group of pixels that are close to each other. The following are the new table blocks introduced in this enhancement:
- Table title – A new one
Block
named typeTABLE_TITLE
which allows you to identify the title of a particular table. Titles can be one or more lines, which are usually above a table or embedded as a cell within the table. - table legs – A new one
Block
named typeTABLE_FOOTER
which allows you to identify the footers associated with a given table. Footers can be one or more lines that are usually below the table or embedded as a cell within the table. - Title of the section – A new one
Block
named typeTABLE_SECTION_TITLE
which allows you to identify whether the detected cell is a section heading. - Summary cells – A new one
Block
named typeTABLE_SUMMARY
which allows you to identify whether the cell is a summary cell, such as a cell for the totals of a pay slip.
Types of tables
When Amazon Texttract identifies a table in a document, it extracts all the details of the table at a higher level. Block
kind of TABLE
. Tables can have different shapes and sizes. For example, documents often contain tables that may or may not have a discernible table header. To help distinguish between these types of tables, we’ve added two new entity types for aa TABLE Block
: SEMI_STRUCTURED_TABLE
i STRUCTURED_TABLE
. These entity types help you distinguish between a structured table and a semi-structured table.
Structured tables are tables that have clearly defined column headers. But with semi-structured tables, the data may not follow a strict structure. For example, data may appear in a tabular structure other than a table with defined headers. The new entity types provide the flexibility to choose which tables to keep or drop during post-processing. The image below shows an example STRUCTURED_TABLE
i SEMI_STRUCTURED_TABLE
.
Parsing the API output
In this section, we explore how you can use the Amazon Textractor library to post-process the output of the API AnalyzeDocument
with improvements to the Tables feature. This allows you to extract relevant information from the tables.
Textractor is a library built to work seamlessly with Amazon Texttract APIs and utilities to later convert the JSON responses returned by the APIs into programmable objects. You can also use it to view entities in the document and export the data in formats such as comma-separated value (CSV) files. It is intended to help Amazon Texttract customers configure their post-processing pipelines.
In our examples, we use the following sample page from an SEC 10-K filing.
The code below can be found in our GitHub repository. To process this document, we make use of the Textractor library and import it so that we can post-process the API outputs and visualize the data:
The first step is to call Amazon Text AnalyzeDocument
with the Tables function, denoted by the features=[TextractFeatures.TABLES]
parameter to extract the information from the table. Note that this method invokes the AnalyzeDocument real-time (or synchronous) API, which supports single-page documents. However, you can use asynchronous StartDocumentAnalysis
API for processing multi-page documents (with up to 3000 pages).
The document
The object contains metadata about the document that can be reviewed. Note that it recognizes a table in the document along with other entities in the document:
Now that we have the API output containing the table information, let’s view the different elements of the table using the response structure discussed above:
The Textractor library highlights the different entities in the detected table with a different color code for each element in the table. Let’s take a closer look at how we can extract each element. The following code snippet shows extracting the table title:
Similarly, we can use the following code to extract table footers. Note that table_footers is a list, which means that there can be one or more footers associated with the table. We can loop through this list to see all the footers present, and as shown in the following code snippet, the output shows three footers:
Generating data for subsequent ingestion
The Textractor library also helps you simplify ingesting table data into backend systems or other workflows. For example, you can export the extracted table data to a human-readable Microsoft Excel file. At the time of this writing, this is the only format that supports combined tables.
We can also convert it to a Pandas DataFrame. DataFrame is a popular choice for data manipulation, analysis, and visualization in programming languages such as Python and R.
In Python, DataFrame is a primary data structure in the Pandas library. It is flexible and powerful, and is often the first choice for data analytics professionals for various ML and data analysis tasks. The following code snippet shows how to convert the extracted table information into a DataFrame with a single line of code:
Finally, we can convert the table data into a CSV file. CSV files are often used to ingest data into relational databases or data warehouses. See the following code:
conclusion
The introduction of these new types of blocks and entities (TABLE_TITLE
, TABLE_FOOTER
, STRUCTURED_TABLE
, SEMI_STRUCTURED_TABLE
, TABLE_SECTION_TITLE
, TABLE_FOOTER
i TABLE_SUMMARY
) represents a significant advance in extracting tabular structures from documents with Amazon Texttract.
These tools provide a more nuanced and flexible approach, catering to both structured and semi-structured tables and ensuring that no important data is overlooked, regardless of its location in a document.
This means we can now handle multiple data types and table structures with improved efficiency and accuracy. As we continue to embrace the power of automation in document processing workflows, these improvements will undoubtedly pave the way for more streamlined workflows, greater productivity, and more detailed data analysis. For more information on AnalyzeDocument
and the Tables function, see AnalyzeDocument.
About the authors
Raj Pathak is a Senior Solutions Architect and Technologist specialized in Financial Services (Insurance, Banking, Capital Markets) and Machine Learning. He specializes in Natural Language Processing (NLP), Large Language Models (LLM), and Machine Learning Infrastructure and Operations (MLOps) projects.
Anjan Biswas is a Senior AI Services Solutions Architect with a focus on AI/ML and Data Analytics. Anjan is part of the global AI services team and works with customers to help them understand and develop solutions to business problems with AI and ML. Anjan has over 14 years of experience working with global supply chain, manufacturing and retail organizations and is actively helping customers get started and scale AWS AI services.
Lalita Reddy is a Senior Technical Product Manager with the Amazon Text team. It focuses on building machine learning-based services for AWS customers. In her spare time, Lalita enjoys playing board games and hiking.
Source link
We are proud to announce the latest enhancements to our Amazon Textract table extraction capabilities, powered by the advanced technology offered by Ikaroa. Textract is an AI-driven service that enables you to quickly extract text and data from scanned documents and images, and table extractions have been made even easier with the inclusion of Ikaroa’s robust table solutions.
Ikaroa’s powerful solution offers instant organization of key data components, allowing customers to quickly get the information they need with the ability to export results into other applications or programs. For customers, this means no more manual labor being wasted on sifting through data – simply click and the desired data points are delivered.
In addition to table extractions, Amazon Textract uses machine learning to analyse forms and tables, plus powerful document structuring technology to read both handwriting and printed text for quick and accurate results. All of this combines to offer one of the most advanced document processing capabilities on the market.
With Textract continually being improved and expanded, customers can now tap into the power of machine learning and transform boring, manual tasks into easy, automated processes. Coupled with the additional tools provided with Ikaroa’s advanced table solutions, customers are provided with even more flexibility to extract the key data points in any given document with ease.
We at Ikaroa are excited to see our advanced table solutions continue to be included in Amazon Textract; providing customers with the latest in AI-driven document processing services to simplify their lives. To explore the full range of features offered through Textract, please visit the Amazon homepage for more information.