Research Article
Print
Research Article
Artificial Intelligence in fraud detection: textual analysis of 10-K filings
expand article infoFlorian Ketelaar, Ana Mićković
‡ University of Amsterdam, Amsterdam, Netherlands
Open Access

Abstract

In this paper, we investigate the potential of Artificial Intelligence (AI) in detecting fraud by analyzing linguistic indicators in 10-K filings. We analyze the word frequencies (positive, negative, uncertainty, litigious), consistency, and readability in the MD&A sections. The AI model, BERT, was trained on these factors to predict fraud, showing significant promise compared to traditional models. The findings suggest that fraudulent filings tend to have more positive words, inconsistent language, and higher readability. This highlights AI’s practical role in improving fraud detection in financial reports.

Keywords

Fraud detection, Artificial Intelligence, financial reports, textual analysis, BERT model

Relevance to practice

This research demonstrates the practical application of AI, specifically BERT, in enhancing fraud detection in financial reports. By identifying key linguistic indicators of fraud, it provides a tool for auditors and regulators to improve accuracy and efficiency in monitoring and investigating potential financial misconduct.

1. Introduction

Technological advancements have continually reshaped industries. One of the most significant transformations currently underway is the rise of Artificial Intelligence (AI), which is becoming increasingly important across various industries and research fields. AI is already impacting many data-related jobs, and therefore it is no surprise that the audit industry is expected to be affected as well. Studies, such as Hasan (2021), suggest that AI will play a significant role in transforming audit processes.

Consulting companies have started adopting AI tools to improve their audit processes, particularly in areas like fraud detection, which is a major risk for firms due to its potential reputational and legal consequences. As fraud evolves, auditors must innovate. AI tools like HeadStart or Argus are designed to help auditors navigate through complex regulations and analyze entire datasets, rather than just samples, to identify risks, anomalies, and trends (Davenport 2016).

We aim to answer the following research question: What factors does AI detect as potentially fraudulent from specific linguistic patterns within 10-K filings?

To be able to correctly and efficiently detect financial fraud, particularly in corporate financial statements like 10-K SEC (Securities and Exchange Commission) filings, is very important and cannot be ignored. Traditional methods of financial fraud detection, that are mostly based on human analysis and standard statistical techniques, have shown to be insufficient at identifying fraud. In contrast to traditional fraud detection methods, AI has advanced computational and learning capabilities making it a good tool for an auditor. The study of Kureljusic and Karger (2023) finds that AI is highly relevant in financial accounting and their results show that AI has several practical uses and benefits for practitioners.

Current literature has focused on the potential of AI in multiple domains of financial analysis, but its use in fraud detection within 10-K filings remains under-explored (Craja et al. 2020). The research question shows relevance because it tests the potential of AI to improve the traditional methods or even become better than these. From a social perspective, to be able to effectively detect fraudulent financial statements and to have automated mechanisms for it is key to maintain confidence in financial reports. To better answer the research question, we use the Management Discussion and Analysis (MD&A) section of the 10-K SEC filings. These MD&A sections are applicable as AI is capable of analyzing text and finding structures in it that might indicate fraudulent behaviour.

2. Literature review

2.1. Fraud

Financial fraud, particularly accounting fraud, is a major concern in manipulating financial statements (Campa et al. 2023). Wang et al. (2006) define fraud as “a deliberate act contrary to law, rule, or policy with intent to obtain unauthorized financial benefit.” Joyce and Biddle (1981) noted that fraud committed by higher-level management is especially hard to detect. Kieso et al. (2020) emphasized that financial fraud involves intentional misstatements or omissions of material information. Beasley et al. (2010) identified improper revenue recognition and asset overstatement as the most common methods of fraud. The Fraud Triangle, outlined by SAS No. 99, describes the three conditions under which fraud occurs: pressure, opportunity, and rationalization.

2.2. Readability of text and sentiment

The readability of financial reports, particularly in the MD&A sections, can provide insights into fraudulent behaviour. The Fog Index, developed by Gunning (1952), measures readability by scoring the complexity of a text, where a higher Fog Index indicates a more difficult to read text. Li (2008) found that firms with lower earnings tend to have higher Fog Index scores, suggesting a link between complex, harder to read reports and poorer financial performance. Martinc et al. (2021) demonstrated that AI can effectively assess the readability of documents, making it a useful tool for identifying anomalies in financial reports.

In addition to readability, the sentiment expressed through positive and negative words in a text can signal underlying financial conditions. Positive words convey positive sentiment, while negative words reflect a negative tone. However, as Ghosh et al. (2015) noted, context can alter meaning, for example, sarcastic use of positive words can imply negativity, and phrases like “not bad” can turn negative meanings positive. Loughran and McDonald (2011) developed word lists to capture sentiment in financial disclosures, categorizing words into positive, negative, litigious, and uncertain. Examples include “gains” and “profitability” for positive words, “loss” and “impairment” for negative, “contracts” and “regulatory” for litigious, and “may” and “risk” for uncertain. These sentiment patterns are critical in analyzing the MD&A sections of 10-K filings.

2.3. Artificial Intelligence and its role in accounting

Artificial Intelligence (AI) encompasses advancements like Machine Learning (ML) and Natural Language Processing (NLP), along with traditional statistical methods such as classification and clustering (Sutton et al. 2016). AI models can be trained using various methods, including supervised, semi-supervised, unsupervised, and reinforcement learning (Burkov 2019). In supervised learning, the dataset contains labeled examples, where each label represents a specific feature, such as identifying fraud in the MD&A sections of 10-K filings.

In this study, NLP is employed to automatically classify companies as fraudulent or non-fraudulent. The model used is based on BERT (Bidirectional Encoder Representations from Transformers), introduced by Devlin (2018). BERT is designed to pre-train deep bidirectional representations from unlabeled text, jointly conditioning on both left and right contexts in all layers. BERT is one of the earliest Large Language Models (LLMs) and can be applied to various tasks such as question answering and language inference without significant modification.

Huang et al. (2023) noted that LLMs, due to their vast number of parameters, can learn semantic and syntactic relationships between words. However, these models are often expensive and difficult to train. Moreover, once trained, LLMs become “black boxes,” making their decision-making processes challenging to interpret (Hassija et al. 2024).

AI has also been widely implemented in accounting, transforming traditional tasks. Studies such as Zhang et al. (2020) show that Big 4 firms have integrated AI technologies to enhance their services through automated data entry, voice analysis, intelligent search engines, and predictive analytics. These innovations allow accountants to focus on higher-value activities like strategic planning and advisory services. AI-based fraud detection methods, such as those used by Brown et al. (2020) and Ikhsan et al. (2022), have demonstrated high accuracy in detecting financial misreporting, thereby improving audit quality.

Ranta et al. (2023) identified several areas where AI has impacted accounting, including changes in the profession, textual analysis of accounting data, and improved prediction methods. Tang et al. (2018) developed an ontology-based fraud detection model using decision trees, achieving 86.67% accuracy in identifying fraudulent financial activity. AI-driven accounting systems, as described by Lee and Tajudeen (2020), have greatly improved productivity, efficiency, and risk management in financial reporting.

Bhattacharya and Mićković (2024) applied BERT to detect fraudulent reporting in MD&A sections of 10-K filings between 1994 and 2013. Their models outperformed others by around 15% in detecting fraud, providing significant economic benefits to regulators, investors, and analysts.

3. Hypotheses development

In the study by Bhattacharya and Mićković (2024), AI algorithms demonstrated significant potential in identifying fraudulent companies from 1994 to 2013, outperforming other models by 15%. With the increasing reliance on AI models for processing Big Data and detecting fraud, it is essential to validate their findings and test whether they hold true for more recent company filings. A key goal of our study is to evaluate the results of Bhattacharya and Mićković (2024) regarding the ability of AI to detect fraudulent activity based on the MD&A sections of 10-K filings. The relevance of this is highlighted by Oyewole et al. (2024), who found that companies increasingly use text-generating tools for financial reporting, making it even more crucial to test the applicability of the findings of Bhattacharya and Mićković (2024) in recent years.

Hypothesis 1: A BERT model remains effective in detecting fraudulent activity in more recent years, with accuracy comparable to or exceeding that reported by Bhattacharya and Mićković (2024).

To detect fraud through word categorization, a model is needed to analyze the language and make well-substantiated conclusions. The agency theory highlights linguistic characteristics that may signal fraudulent behavior. Purda and Skillicorn (2015) demonstrated that a support vector machine (SVM) classifier, using the top 200 fraud-linked words, could effectively identify fraudulent reporting in MD&A sections of 10-K filings. Bhattacharya and Mićković (2024) also provided evidence that fraudulent firms tend to use more positive words while minimizing negative language, suggesting that they do so deliberately to conceal fraudulent activities. Their study relied on Loughran and McDonald’s (2011) word classifications to analyze the frequency of positive and negative words. Additionally, Larcker and Zakolyukina (2012) found that deceptive CEOs often use more positive emotional language, which could help detect fraud, particularly in higher-level management, as noted by Joyce and Biddle (1981).

Building on the work of Purda and Skillicorn (2015), Bochkay et al. (2023), Loughran and McDonald (2011), Larcker and Zakolyukina (2012), Joyce and Biddle (1981), and the suggestions of Bhattacharya and Mićković (2024), we propose the following hypothesis. In this hypothesis, we examine the frequency of positive, negative, litigious, and uncertainty words in fraudulent versus non-fraudulent firms’ MD&A sections.

Hypothesis 2: Firms engaged in fraudulent activities use a higher/lower frequency of positive, negative, litigious, and uncertainty words in their MD&A sections compared to non-fraudulent firms.

To further explore the linguistic factors AI detects, we analyze the frequency of specific word categories over time. Huang et al. (2023) suggest that AI can interpret both the semantics and syntactics of text, while Larcker and Zakolyukina (2012) note that deceptive CEOs tend to use more positive emotional language. This indicates that one factor AI may detect is the variance in word category frequency. Fraudulent firms may exhibit more fluctuation in their language to manipulate perceptions, while non-fraudulent firms are likely to use more consistent language over time. This variance in word usage could be a key indicator AI picks up when detecting fraudulent activities.

Hypothesis 3: The frequency of word categories in non-fraudulent filings is more consistent than in fraudulent filings.

In addition to analyzing the frequency and consistency of word categories, it is crucial to consider the readability of the MD&A sections as a potential indicator of fraudulent activity. A common measure of readability is the in section 2.2 mentioned Fog Index (Gunning 1952). Li (2008) found that companies with lower earnings tend to have a higher Fog Index, which aligns with agency theory (Eisenhardt 1989) that suggests firms with lower earnings may have incentives to fraudulently inflate them. Martinc et al. (2021) demonstrated that AI can interpret the readability of documents, while Nakashima et al. (2022) found significant differences in readability between fraudulent and non-fraudulent financial statements in Japan, with fraudulent filings being more difficult to read.

Hypothesis 4: Firms engaged in fraudulent activities have a higher Fog Index in their MD&A sections compared to non-fraudulent firms.

4. Research method

We employ AI models based on machine learning techniques like Natural Language Processing (NLP) and anomaly detection to identify patterns indicative of fraud. Specifically, we use BERT (Bidirectional Encoder Re­presentations from Transformers), designed by Google (Devlin 2018) to understand word context by examining surrounding text. To evaluate the performance of our AI model, we compare it with models from previous studies. The model is trained using the Management Discussion and Analysis (MD&A) sections from 10-K filings. This research aims to explore AI’s potential in financial fraud detection, contributing both theoretical and practical insights into AI’s role in financial oversight.

In our analysis, we use the EDGAR database from the US Securities and Exchange Commission, which contains 10-K annual reports of US firms from 1994 to 2024. These 10-K reports, publicly available, were downloaded via the University of Notre Dame (Loughran and McDonald 2016). Item 7, the MD&A sections, are extracted from these reports for analysis, as prior research shows they provide valuable data for text analysis, fraud detection, and investor information (Bhattacharya and Mićković (2024); Goel and Uzuner 2016; Brown et al. 2020; Purda and Skillicorn 2015). The extracted text is parsed for machine learning analysis, following the method of Bhattacharya and Mićković (2024) as described below, who highlight the significance of MD&A in detecting fraud through AI-based text analysis. According to the SEC, MD&A sections provide critical insights that clarify and supplement financial statements (Cole and Jones 2004).

Our dataset is constructed from the EDGAR database, using 10-K SEC filings from 1994 to 2021, pre-parsed by the University of Notre Dame (Loughran and McDonald 2016). The AAER dataset (Dechow et al. 2011), which tracks material fraud occurrences as reported by the SEC, is updated through 2021, serving as our cutoff period. This differs from Bhattacharya and Mićković (2024), who used an earlier version of the AAER with a 2013 cutoff. The AAER database contains detailed information about fraud cases, including managers involved and the periods during which the fraud took place, though there is typically a three-year time lag due to processing. Studies, such as Bertomeu et al. (2021), show that models trained with the AAER dataset are more effective at detecting misstatements.

For our supervised learning model, we label examples as fraudulent or not by matching the Central Index Key (CIK), a unique identifier for U.S. firms, between the AAER database and the MD&A sections of 10-K filings. We extract the MD&A text from Item 7 of the 10-K annual reports. A search algorithm identifies the CIK so the MD&A text can be linked to the correct company. If the CIK is found in the AAER database, we label the firm as fraudulent (Fraud = 1); otherwise, it is labeled as non-fraudulent (Fraud = 0). We label all filings by a firm identified as fraudulent in a given year as fraudulent, under the assumption that fraudulent activity may span multiple years and may not be isolated to the filing flagged by regulators. While this may overestimate the number of fraudulent filings, it aligns with prior work suggesting that fraud is often persistent and difficult to detect in a single filing.

The dataset includes a total of 123,415 filings from 1994 to 2021, with 5,785 identified as fraudulent, representing 4.66% of the total. Figure 1 visually represents this data, showing the non-fraudulent filings per year on the left vertical axis and fraudulent filings on the right vertical axis. For brevity, we have omitted the detailed table of numbers from the manuscript; however, it is available upon request.

Figure 1.

Fraudulent vs. non-fraudulent 10-K filings per year. Note: The blue line represents the number of non-fraudulent filings per year, shown on the left vertical axis, while the red line represents the number of fraudulent filings per year, shown on the right vertical axis.

4.1. Descriptive statistics

To understand the sample, we calculated the total number of words, sentences, and words per sentence in the MD&A sections. We found that the number of words per year in MD&A sections increased significantly over time. Before the year 2000, MD&A sections contained around 2,000 to 3,000 words per year, but after 2015, this range grew to 12,000 to 14,000 words; an increase of nearly 10,000 words over the last 15 years. This aligns with the findings of Brown and Tucker (2011), who suggest that the increase in word count may result from managers using boilerplate disclosures, which are standardized and provide little firm-specific information, reducing the overall usefulness of the MD&A.

On average, the MD&A sections in our dataset contain 8,854 words, 498 sentences, and 17 words per sentence. We observed that both the number of sentences and the average sentence length increased over time, with the rise in total sentences likely explained by the increase in total words. However, the increase in sentence length was an unexpected finding, potentially due to the inclusion of boilerplate information, as noted by Brown and Tucker (2011).

Further, Figure 2 provides a visual comparison of the word count between fraudulent and non-fraudulent MD&A sections. Interestingly, fraudulent filings have an average of 12,482 words, exceeding the 8,926 word average for non-fraudulent filings. This contradicts Nakashima et al. (2022), who found that MD&A disclosures were insignificantly shorter for Japanese fraudulent firms. One possible explanation is that fraudsters may attempt to bury fraudulent activities in longer texts, as readers are less likely to scrutinize lengthy reports. This is supported by Hancock (2007), who suggests that liars tend to use more words, which could explain our findings.

Figure 2.

Comparison of word count in fraudulent vs. non-fraudulent MD&A sections. Note: The figure compares the average word count in MD&A sections of fraudulent (red line) and non-fraudulent (blue line) 10-K filings, with the overall average word count (green line) included for reference.

4.2. Data analysis

To validate the findings of Bhattacharya and Mićković (2024), we use the BERT-Base-uncased model from Hugging Face, a pre-trained language model. Similar to Bhattacharya and Mićković (2024), we apply the WordPiece tokenizer, which splits text into tokens by removing punctuation and identifying word roots. Following Devlin (2018) and Bhattacharya and Mićković (2024), we use a classification token (CLS) at the start and a separation token (SEP) at the end of each input, ensuring sequences do not exceed the 512-token limit. We fine-tune the BERT models using Google Colab Pro’s Nvidia Tesla A100 GPU. The fine-tuning process is done with a learning rate of 2e-5, the AdamW optimizer, over 3 epochs, and with a batch size of 16.

Bhattacharya and Mićković (2024) approach fraud detection as a ranking task, using rank averaging to ensure that fraudulent samples are ranked higher than non-fraudulent ones, which reduces variability in predictions. They combine predictions from two models, BERTfirst and BERTlast, trained on the first and last 512 tokens of the MD&A section, respectively, to capture both introductory summaries and future outlooks. To replicate their approach, we construct two models (BERTfirst and BERTlast), train them on the first and last 512 tokens of the MD&A sections, and then average their predictions. Our final prediction is the rank average of the outputs from BERTfirst and BERTlast, where predfinal = 1/2 * rank(predfirst) + 1/2 * rank(predlast).

To validate the model and test Hypothesis 1 while staying consistent with Bhattacharya and Mićković (2024), we use a rolling window of consecutive five years for training, followed by the subsequent year as the test set. The validation set includes the years 1994 to 1999, in line with their study, to optimize our model parameters. For example, we train the model on data from 2000 to 2004 and test it on 2005. This approach is applied from 1994 to 2021, giving us a 23-year test period, extending beyond their 2013 cutoff. For evaluation, we also use the area under the ROC curve (AUC) as our primary metric. Given the class imbalance common in fraud detection, AUC is suitable as it measures the likelihood that a randomly selected fraud sample is ranked higher than a non-fraud sample, ensuring comparability with Bhattacharya and Mićković (2024).

To test Hypotheses 2 and 3, we use the word list from Loughran and McDonald (2023). We first split the dataset into fraud and non-fraud instances and then filtered the text based on the words from the Loughran and McDonald list. This list categorizes words into four sections: Positive, Negative, Litigious, and Uncertain. The dataset is organized by year, and we extracted the corresponding MD&A sections for each year. We then counted the occurrences of each word per year and categorized them according to Loughran and McDonald’s classification. Additionally, we calculated the total number of words per year to determine the percentage occurrence of each category annually.

To test Hypothesis 4, we use the Fog Index developed by Gunning (1952). We split the dataset into fraudulent and non-fraudulent filings per year and calculated the average Fog Index for each year, following Gunning’s method. We use an Ordinary Least Squares (OLS) regression analysis to test whether there are significant differences in word categories and readability between fraudulent and non-fraudulent MD&A sections. We also examine if these differences have changed over time, particularly with the rise of AI-generated financial reports.

Separate OLS regressions are conducted for the Fog Index and the occurrence of word categories (positive, negative, uncertainty, and litigious) to test for significant differences. We compare trends between two periods, 1994–2007 and 2008–2021, to see if word category usage has shifted. The regression includes variables for baseline measures, whether the MD&A section is non-fraudulent, year, and an interaction term to assess how trends differ between fraudulent and non-fraudulent sections over time.

5. Results

5.1. Hypothesis 1

We formulated Hypothesis 1 to assess whether a BERT model remains effective in detecting fraudulent activity in recent years, with accuracy comparable to or exceeding that of BM (2024). To gauge performance, we use their results as a benchmark, directly comparing our final model to their BERT models.

Table 1 displays the yearly and average AUC scores for our models from 2014 to 2021. Our BERT models, trained on MD&A sections from 10-K filings, are evaluated against the benchmark models. Our BERTfirst, BERTlast, and BERTfinal models achieved average AUC scores of 0.844, 0.661, and 0.797, respectively, compared to Bhattacharya and Mićković (2024), which reported AUC scores of 0.804, 0.816, and 0.826. While the AUC results for 2000–2013 are omitted for brevity, they are available upon request and show comparable performance between our models and Bhattacharya and Mićković (2024).

Table 1.

AUC score per BERT model per year.

AUC 2014 2015 2016 2017 2018 2019 2020 2021 Average
BERTfirst 0,738 0,892 0,846 0,878 0,824 0,878 0,823 0,869 0,844
BERTlast 0,521 0,665 0,711 0,636 0,648 0,658 0,736 0,711 0,661
BERTfinal 0,657 0,825 0,842 0,814 0,771 0,823 0,817 0,827 0,797

We tested Hypothesis 1 by comparing our BERT models to the benchmark models of Bhattacharya and Mićković (2024). The results indicate no significant differences between the models for the period 2000–2013, confirming that our BERTfirst model performs comparably. Furthermore, our BERTfirst model continues to be effective in detecting fraud from 2014–2021, with an average AUC score of 0.844, exceeding the benchmark’s 0.804. This confirms Hypothesis 1, showing that BERT remains effective in recent years.

5.2. Hypothesis 2

Hypothesis 2 postulates that firms engaged in fraudulent activities use a different frequency of positive, negative, litigious, and uncertainty words in their MD&A sections compared to non-fraudulent firms. To test this, we examine the occurrences of these specific word categories in both fraudulent and non-fraudulent MD&A sections, aiming to determine whether linguistic characteristics differ between the two groups. Figure 3 presents the average percentage of each word category—positive, negative, litigious, and uncertainty—used per year from 1994 to 2021 for both fraudulent and non-fraudulent filings.

Figure 3.

Percentage of word occurrences in MD&A sections for fraudulent vs. non-fraudulent companies. Note: The y-axis represents the average yearly percentage of each word category occurring in the MD&A sections. The word categories (Positive, Negative, Uncertainty, and Litigious) are based on the classifications by Loughran and McDonald (2023).

We also conduct Ordinary Least Squares (OLS) regression analyses to test for significant differences in word category usage between fraudulent and non-fraudulent MD&A sections over the period 1994–2021. The results show that fraudulent MD&A sections contain significantly more positive words, with the difference being statistically significant at the 99% level. This finding aligns with the suggestion of Bhattacharya and Mićković (2024), confirming that fraudulent filings use more positive language. A separate test for the 1994–2013 period also showed a significant difference at the 5% level (not included in the manuscript for brevity).

However, the analysis reveals no significant differences in the usage of negative, uncertainty, and litigious words between fraudulent and non-fraudulent firms at the 95% significance level, leading us to partially reject H2 for these categories. There is some indication, at the 90% significance level, that non-fraudulent MD&A sections contain fewer negative words.

5.3. Hypothesis 3

In Hypothesis 3, we propose that the frequency of word categories in non-fraudulent filings is more consistent over time compared to fraudulent filings. To test this, we examine whether the trend in the frequency of word categories in MD&A sections differs between non-fraudulent and fraudulent firms.

Interestingly, Figure 3 reveals a stabilizing trend in the occurrence of word categories in recent years. This aligns with the study by Oyewole et al. (2024), which suggests that the use of AI-generated financial reports may contribute to this stabilization. To investigate this, we tested whether the slope of the regression for word usage trends was less steep during the last 14 years (2008–2021) compared to the first 14 years (1994–2007). The difference in slopes is shown in Figure 4, which illustrates the trends in positive, negative, uncertainty, and litigious word usage across both periods. The figure presents scatter plots and trend lines, with the first time period T1 (1994–2007) indicated in red and second time period T2 (2008–2021) in blue, allowing for a clear visual comparison of word usage over time for each category.

Figure 4.

Trends in word usage for positive, negative, uncertainty, and litigious words across two time periods. Note: The figure presents scatter plots and trend lines for the usage of positive, negative, uncertainty, and litigious words in MD&A sections across two periods: T1 (1994–2007) shown in red, and T2 (2008–2021) shown in blue. It provides a visual comparison of word usage trends over time for each category, illustrating the difference in slopes between the two periods.

Our regression analysis (not tabulated for brevity) examined word category trends (positive, negative, uncertainty, and litigious) between two periods (T1 and T2), as shown in Figure 4.

The results show a significant decline in the use of positive words in both fraudulent and non-fraudulent MD&A during T1. However, in T2, the decline in positive word usage was less steep for fraudulent MD&A, indicating some stabilization, while the trend remained insignificant for non-fraudulent filings.

Negative word usage was significantly higher in the first period for both groups. Interestingly, there was a significant reduction in the use of negative words in the last period, suggesting a shift in language over time.

Both fraudulent and non-fraudulent MD&A showed an increase in uncertainty words during T1, but this trend reversed to a decline in T2. Litigious word usage showed an insignificant positive trend in fraudulent MD&A during T1, which shifted to a negative trend in T2. This suggests that fraudulent filings may be using less legalistic language over time, while the results for non-fraudulent filings were not significant.

5.4. Hypothesis 4

In Hypothesis 4, we propose that firms engaged in fraudulent activities have a higher Fog Index, indicating more complex and difficult to read MD&A sections compared to non-fraudulent firms. To test this, we calculated the average Gunning Fog Index for the MD&A sections, with the results visualized in Figure 5. Contrary to our hypothesis, the results show that the average Fog Index of non-fraudulent firms is actually higher than that of fraudulent firms. This contradicts our expectation and the findings of Nakashima et al. (2022), who suggested that fraudulent filings are harder to read, and Moffitt and Burns (2009), who found that fraudulent 10-Ks tend to contain more complex language.

Figure 5.

Average Fog Index per year for fraudulent vs. non-fraudulent MD&A sections. Note: The y-axis displays the Fog Index, which measures the readability of the text. The blue solid line represents the average Fog Index for non-fraudulent MD&A sections per year, while the red solid line represents the average Fog Index for fraudulent MD&A sections per year.

We also conducted a regression analysis to further explore the relationship between the Fog Index and fraudulent vs. non-fraudulent MD&A sections over time. While Figure 5 suggests that the average Fog Index for fraudulent firms is lower in many years, the difference is not statistically significant. Therefore, we cannot conclusively state that firms engaging in fraudulent activities have a significantly lower Fog Index than non-fraudulent firms. However, the results do indicate a statistically significant increase in the Fog Index over time, suggesting that MD&A sections are becoming more difficult to read overall.

6. Discussion and conclusion

Our research explores the use of AI, specifically BERT models, to detect fraud through textual analysis of MD&A sections in 10-K filings. The study tests four hypotheses on the effectiveness of BERT models and linguistic indicators of fraudulent activity.

Hypothesis 1 confirmed that BERT models remain effective in detecting fraud, replicating the findings of Bhattacharya and Mićković (2024). This highlights the continued relevance of BERT models for fraud detection, aligning with Kureljusic and Karger (2023), who emphasize AI’s usefulness in financial accounting. Given the findings of Oyewole et al. (2024) on AI-generated reports, it is possible that the AI models themselves are influencing the linguistic patterns detected by BERT, suggesting an area for further research.

Hypothesis 2 tested whether fraudulent MD&A sections use a higher frequency of positive, litigious, and uncertainty words, and fewer negative words. The results supported the hypothesis for positive words, consistent with Bhattacharya and Mićković (2024), suggesting that fraudulent firms use positive language to mask fraud. However, contrary to expectations, fraudulent MD&A sections also contained more negative words, indicating a more complex linguistic strategy. No significant differences were found for uncertainty and litigious words.

Hypothesis 3 explored whether non-fraudulent MD&A sections show more consistent trends in word categories over time compared to fraudulent sections. While we found some evidence of a decreasing trend in litigious word usage in non-fraudulent filings, the trend lines for both fraudulent and non-fraudulent sections became more stable in recent years. This stabilization could reflect the increasing use of AI-generated reports, as suggested by Oyewole et al. (2024).

Hypothesis 4 examined whether fraudulent MD&A sections are less readable, as measured by the Fog Index. Contrary to the hypothesis, non-fraudulent MD&As had a slightly higher Fog Index, although the difference was not statistically significant. This finding contrasts with Nakashima et al. (2022), who found that fraudulent reports were harder to read. The overall increase in the Fog Index over time suggests that corporate reports are becoming more complex, possibly due to regulatory or industry changes.

A limitation of this study, noted by Larcker and Zakolyukina (2012) and Ghosh et al. (2015), is the reliance on word-count methods based on psychosocial dictionaries, which may not capture contextual meaning. Future research could improve word categorization by combining sentiment analysis with word structures, as suggested by Ghosh et al. (2015). Another limitation is our reliance on the AAER database to identify fraudulent firms, which could benefit from broader validation.

In summary, this study validates and extends the findings of Bhattacharya and Mićković (2024), confirming that BERT models are highly effective in detecting fraudulent activity. The results indicate that fraudulent MD&A sections contain more positive and negative words, though no significant differences were found for uncertainty and litigious words. The analysis of readability found no significant differences between fraudulent and non-fraudulent MD&As, though readability has decreased over time. This study offers valuable insights into the role of AI in fraud detection and provides practical implications for auditors and regulators. Future research should further investigate linguistic features associated with fraud and explore the impact of AI-generated financial reports on fraud detection.

F.H.E. Ketelaar MSc – Florian 1 is an AI enthusiast and auditor at a Big 4 audit firm.

Dr. A. Mićković – Ana is an Assistant Professor of Accounting at the University of Amsterdam.

Note

1

Florian Ketelaar is one of the winners of the MAB Thesis Award 2024. This article is based on his master thesis.

References

  • Beasley MS, Carcello JV, Hermanson DR, Neal TL (2010) Fraudulent Financial Reporting 1998–2007: An Analysis of US Public Companies. In: COSO Report.
  • Brown NC, Crowley RM, Elliott WB (2020) What Are You Saying? Using topic to Detect Financial Misreporting. Journal of Accounting Research 58(1): 237–291. https://doi.org/10.1111/1475-679X.12294
  • Burkov A (2019) The hundred-page machine learning book the hundred-page machine learning book. Andriy Burkov.
  • Campa D, Quagli A, Ramassa P (2023) The roles and interplay of enforcers and auditors in the context of accounting fraud: a review of the accounting literature. Journal of Accounting Literature 47(5): 151–183. https://doi.org/10.1108/JAL-07-2023-0134
  • Devlin J (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  • Goel S, Uzuner O (2016) Do sentiments matter in fraud detection? Estimating semantic orientation of annual reports. Intelligent Systems in Accounting, Finance and Management 23(3): 215–239. https://doi.org/10.1002/isaf.1392
  • Ghosh D, Guo W, Muresan S (2015) [September] Sarcastic or not: Word embeddings to predict the literal or sarcastic meaning of words. In proceedings of the 2015 conference on empirical methods in natural language processing,1003–1012. https://doi.org/10.18653/v1/D15-1116
  • Gunning R (1952) The technique of clear writing.
  • Hancock JT (2007) Digital deception. Oxford Handbook of Internet Psychology 61(5): 289301.
  • Hassija V, Chamola V, Mahapatra A, Singal A, Goel D, Huang K, Scardapane S, Spinelli I, Mahmud M, Hussain A (2024) Interpreting black-box models: a review on explainable artificial intelligence. Cognitive Computation 16(1): 45–74. https://doi.org/10.1007/s12559-023-10179-8
  • Huang AH, Wang H, Yang Y (2023) FinBERT: A large language model for extracting information from financial text. Contemporary Accounting Research 40(2): 806–841. https://doi.org/10.1111/1911-3846.12832
  • Joyce EJ, Biddle GC (1981) Anchoring and adjustment in probabilistic inference in auditing. Journal of Accounting Research 19(1): 120–145. https://doi.org/10.2307/2490965
  • Ikhsan WM, Ednoer EH, Kridantika WS, Firmansyah A (2022) Fraud detection automation through data analytics and artificial intelligence. Riset 4(2): 103–119. https://doi.org/10.37641/riset.v4i2.166
  • Kieso DE, Warfield TD, Weygandt JJ (2020) Intermediate accounting: IFRS edition.
  • Kureljusic M, Karger E (2023) Forecasting in financial accounting with artificial intelligence – A systematic literature review and future research agenda. Journal of Applied Accounting Research 25(1): 81–104. https://doi.org/10.1108/JAAR-06-2022-0146
  • Lee CS, Tajudeen FP (2020) Usage and impact of artificial intelligence on accounting: Evidence from Malaysian organisations. Asian Journal of Business and Accounting 13(1): 213–239. https://doi.org/10.22452/ajba.vol13no1.8
  • Martinc M, Pollak S, Robnik-Šikonja M (2021) Supervised and Unsupervised Neural Approaches to Text Readability. Computational Linguistics 47(1): 141–179. https://doi.org/10.1162/coli_a_00398
  • Moffitt K, Burns MB (2009) What does that mean? Investigating obfuscation and readability cues as indicators of deception in fraudulent financial reports. AMCIS 2009 Proceedings, 399.
  • Nakashima M, Hirose Y, Hirai H (2022) Fraud Detection by Focusing on Readability of MD&A Disclosure: Evidence from Japan. Journal of Forensic and Investigative Accounting 14(2).
  • Oyewole AT, Adeoye OB, Addy WA, Okoye CC, Ofodile OC, Ugochukwu CE (2024) Automating financial reporting with natural language processing: A review and case analysis. World Journal of Advanced Research and Reviews 21(3): 575–589. https://doi.org/10.30574/wjarr.2024.21.3.0688
  • Purda L, Skillicorn D (2015) Accounting variables, deception, and a bag of words: Assessing the tools of fraud detection. Contemporary Accounting Research 32(3): 1193–1223. https://doi.org/10.1111/1911-3846.12089
  • Ranta M, Ylinen M, Järvenpää M (2023) Machine learning in management accounting research: Literature review and pathways for the future. European Accounting Review 32(3): 607–636. https://doi.org/10.1080/09638180.2022.2137221
  • Sutton SG, Holt M, Arnold V (2016) “The reports of my death are greatly exaggerated”—Artificial intelligence research in accounting. International Journal of Accounting Information Systems 22: 60–73. https://doi.org/10.1016/j.accinf.2016.07.005
  • Tang XB, Liu GC, Yang J, Wei W (2018) Knowledge-based financial statement fraud detection system: Based on an ontology and a decision tree. Knowledge Organization 45(3): 205–219. https://doi.org/10.5771/0943-7444-2018-3-205
  • Wang JH, Liao YL, Tsai TM, Hung G (2006) Technology-based financial frauds in Taiwan: issues and approaches. IEEE International Conference on Systems, Man and Cybernetics, Vol. 2, 1120–1124. IEEE. https://doi.org/10.1109/ICSMC.2006.384550
login to comment