Published on Nov. 3, 2023
Updated on Dec. 19, 2023
In recent years, the prowess of large language models (LLMs) like ChatGPT by OpenAI has garnered immense attention, not just for their raw computational abilities, but also for their potential as surrogate knowledge graphs. Trained on a plethora of data sources, ranging from peer-reviewed research articles to the vast swathes of the web, these models proffer tantalizing prospects for tasks that have traditionally posed immense challenges, such as biological pathway extraction. Pathways, which serve as schematic representations of inter-molecular interactions, are pivotal for understanding intricate biological processes, elucidating disease mechanisms, and spearheading drug development.
However, as promising as LLMs might be, they are not without their challenges, such as model ‘hallucinations’, a consequence of their exposure to data of varying quality. This research embarks on a journey to harness the potential of LLMs, specifically OpenAI’s GPT-3.5-turbo, for mining gene interactions pivotal for pathway extraction using an iterative prompt refinement technique. Benchmarking against the esteemed KEGG Pathway Database, we experimented with diverse prompting strategies, targeting gene interactions like activation, inhibition, and phosphorylation. Preliminary results using direct questioning manifested varied F-1 scores, prompting the adoption of role and few-shot prompting techniques. Enhanced metrics were observed over null prompts, and with the iterative refinement algorithm with GPT-4, peak performance was realized after three iterations. Leveraging this refined prompt, which amalgamated a specialized role with explanatory text, significant enhancements in precision, recall, and F-1 scores were achieved. Going beyond singular interactions, this research ventured into deciphering complex gene interplays, such as the relations between EGFR and ERK, striving to reconstruct holistic gene pathways pertinent to diseases like non-small cell lung cancer. Traditional direct approaches showcased limited success but applying “least-to-most” prompting exhibited significant potential in elucidating a more robust gene interaction panorama. Through these methodologies, our work illuminates a pathway (pun intended) to advance the utilization of LLMs for biomedical research, specifically in the domain of pathway extraction. The research not only underscores the potential and intricacies of using LLMs in bioinformatics but also champions a new era where these models can be harnessed to significantly reduce the labor-intensive process of pathway mapping, offering a robust foundation for subsequent scientific endeavors.