Shikun (Barry) Ke - Research Papers

Working Papers

Analysts' Belief Formation in Their Own Words

[Paper]
This version: March 2025.
Best Paper Award, The 4th Hong Kong Conference for Fintech, AI, and Big Data in Business
Selected presentations: Oxford SMLFin Seminar, MFA 2025, Eastern Finance Association 2025, NBER Behavioral Finance 2025, The 9th MPWZ-CEPR Text-As-Data Workshop, FIRS 2025 (PhD Session), The 4th Hong Kong Conference for Fintech, AI, and Big Data in Business, The 13th Helsinki Finance Summit on Investor Behavior, European Finance Association 2025, AFA 2026

Abstract:

I study the formation of analysts' subjective beliefs about firms' earnings using analysts' own written text from over 1.1 million equity research reports. Text in analyst reports strongly predicts analysts' forecast revisions and forecast errors. Using a Large Language Model, I distinguish between factual and subjective content and distill it into interpretable topics on firm fundamentals. I document three sets of novel findings regarding analysts' subjective beliefs. (1) I show that analysts' attention allocation varies significantly over business cycles, firms, and forecast horizons. Analysts pay more attention to profitability information during booms and pay more attention to financial conditions and macroeconomics during recessions. These patterns align with a model of rational inattention. (2) I introduce a novel text-instrumented Coibion-Gorodnichenko regression to study analysts' misreaction to specific information. I find a pervasive underreaction across topics in analysts' short-term earnings forecasts, while their overreaction in long-term forecasts is mainly significant for business operations, corporate management, and macroeconomic information. This pattern is consistent with a "story-statistics gap'' in associative memory being an important driver of overreaction to qualitative, story-like information. (3) I find that both asymmetric information and differences of opinion contribute to disagreement in earnings forecasts. Together, these results offer new insights into the formation of subjective beliefs about firms' earnings.

APT or “AIPT”? The Surprising Dominance of Large Factor Models

[Paper]
Coauthors: Antoine Didisheim, Bryan T. Kelly, and Semyon Malamud
This version: September 2024.
Previously titled “Complexity in Factor Pricing Model”
Selected presentations (* by coauthor): NBER Big Data and Securities Markets 2023*, AFA 2024

Abstract:

We introduce artificial intelligence pricing theory (AIPT). In contrast with the APT's foundational assumption of a low dimensional factor structure in returns, the AIPT conjectures that returns are driven by a large number of factors. We first verify this conjecture empirically and show that nonlinear models with an exorbitant number of factors (many more than the number of training observations or base assets) are far more successful in describing the out-of-sample behavior of asset returns than simpler standard models. We then theoretically characterize the behavior of large factor pricing models, from which we show that the AIPT's "many factors'' conjecture faithfully explains our empirical findings, while the APT's "few factors'' conjecture is contradicted by the data.

On the Testability of the Anchor Words Assumption in Topic Models

Revise and Resubmit, Quantitative Economics (Special Issue on Economics and AI/ML)
[Paper ] [Online Appendix]
Coauthors: Simon Freyaldenhoven, Dingyi Li, and José Luis Montiel Olea
This version: August 2024.
Selected presentations (* by coauthor): ESIF Machine Learning Conference 2024*

Abstract:

Topic models are a simple and popular tool for the statistical analysis of textual data. Their identification and estimation is typically enabled by assuming the existence of anchor words; that is, words that are exclusive to specific topics. In this paper we show that the existence of anchor words is statistically testable: there exists a hypothesis test with correct size that has nontrivial power. This means that the anchor-word assumption cannot be viewed simply as a convenient normalization. Central to our results is a simple characterization of when a column-stochastic matrix with known nonnegative rank admits a separable factorization. We test for the existence of anchor words in two different datasets derived from the transcripts of the meetings of the Federal Open Market Committee (FOMC) - the body of the Federal Reserve System that sets monetary policy in the United States - and reject the null hypothesis that anchor words exist in one of them.

The Double-edged Sword of Data Mining: Implications on Asset Pricing and Information Efficiency

[Paper]
This version: November 2023.
Runner-up, the Kuldeep Shastri Outstanding Doctoral Student Paper, Eastern Financial Association 2024
Selected presentations: WUSTL EGSC 2023, Eastern Finance Association 2024, FMA Applied Finance Conference 2024, NFA (PhD Session) 2024

Abstract:

Does data mining always increase price efficiency? Not necessarily. I incorporate data mining into a standard asset pricing model and identify a novel cost of complexity that arises endogenously from data mining. When a data miner explores alternative data, she faces a scarcer training history relative to potential predictors (increasing complexity) and an increasing difficulty in extracting useful signals (decreasing return in data efficacy). The cost of complexity and decreasing return in data efficacy together imply a finite optimal data mining level, such that excess data mining will lead to lower price informativeness. Empirically, I provide evidence of decreasing return in data efficacy in the context of the "factor zoo'', and I show that the release of satellite data reduces price informativeness in a difference-in-difference setting.

What Drives Trading in Financial Markets? A Big Data Perspective

[Paper ]
Coauthor: Anton Lines
This version: September 2022.
Selected presentations (* by coauthor): CICF 2023*, FutFinInfo 2024*, AFA 2025*

Abstract:

We use deep Bayesian neural networks to investigate the determinants of trading activity in a large sample of institutional equity portfolios. Our methodology allows us to evaluate hundreds of potentially relevant explanatory variables, estimate arbitrary nonlinear interactions among them, and aggregate them into interpretable categories. Deep learning models predict trading decisions with up to 86% accuracy out-of-sample, with market liquidity and macroeconomic conditions together accounting for most (66-91%) of the explained variance. Stock fundamentals, firm-specific corporate news, and analyst forecasts have comparatively low explanatory power. Our results suggest that market microstructure considerations and macroeconomic risk are the most crucial factors in understanding financial trading patterns.

The Social Welfare of Stock Market Mispricing

[Paper]
This version: April 2024.
Selected presentations: MFA (PhD Session) 2024, Eastern Finance Association 2024, FMA Annual Meeting 2024

Abstract:

This paper studies the social value of eliminating mispricing in the US stock markets. By characterizing a model in which active managers extract abnormal value from trading against mispricing, I show that the mispricing of a stock, relative to a benchmark asset pricing model, exactly equals to the marginal social value of mispricing trading. Combining conditional mispricing estimates from a novel instrumented factor model and a calibrated price impact function, I find that the mispricing relative to CAPM translates into a welfare cost of about 3.1% of annual nominal GDP in the US, and increases to more than 8% during the Tech Bubble, the Global Financial Crisis, and in the recent Covid pandemic. These results suggest a large potential welfare gain from active management that eliminates stock mispricing.

Publication

Robust Machine Learning Algorithms for Text Analysis

[Paper ] [Publisher Version ] [Online Appendix] [Replication Code]
Coauthors: José Luis Montiel Olea, and James Nesbit
Quantitative Economics, Nov 2024, Vol 15, Issue 4, p.939-970

Abstract:

We study the Latent Dirichlet Allocation model, a popular Bayesian algorithm for text analysis. Our starting point is the generic lack of identification of the model’s parameters, which suggests that the choice of prior matters. We then characterize by how much the posterior mean of a given functional of the model’s parameters varies in response to a change in the prior, and we suggest two algorithms to approximate this range. Both of our algorithms rely on obtaining multiple Nonnegative Matrix Factorizations of either the posterior draws of the corpus’ population term-document frequency matrix or of its sample analogue. The key idea is to maximize/minimize the functional of interest over all these nonnegative matrix factorizations. To illustrate the applicability of our results, we revisit recent work on the effects of increased transparency on discussions regarding monetary policy decisions in the United States.

Page updated

Google Sites

Report abuse