Ancestral Legacies of an Insular Archipelago A genomic exploration of prehistoric Japan and mainland Asia Niall Cooke A thesis submitted for the degree of Doctor of Philosophy School of Medicine, Trinity College Dublin March 2022 i Declaration I declare that this thesis has not been submitted as an exercise for a degree at this or any other university and it is entirely my own work. I agree to deposit this thesis in the University’s open access institutional repository or allow the Library to do so on my behalf, subject to Irish Copyright Legislation and Trinity College Library conditions of use and acknowledgement. I consent to the examiner retaining a copy of the thesis beyond the examining period, should they so wish (EU GDPR May 2018). _____________________________ Niall Pádraig Cooke March 2022 ii Abbreviations ABC - Approximate Bayesian Computation BAM - Binary Alignment Map aBF - approximate Bayes factor ADH - alcohol dehydrogenase aDNA - ancient DNA ALDH - aldehyde dehydrogenase aML - approximated marginal likelihood ANE - Ancient North Eurasian ANS - Ancient North Siberian AP - Ancient Paleosiberian AR - Amur River BP - before present bwa - Burrows-Wheeler Aligner CDX - Chinese Dai in Xishuangbanna CEU - Northern Europeans from Utah CHB - Han Chinese in Beijing, China CHS - Han Chinese South EBA - Early Bronze Age EDAR - ectodysplasin A receptor EN - Early Neolithic gatk - Genome Analysis Toolkit Ga100k - GenomeAsia100K GP - genotype probability HOA - Human Origins Array IA - Iron Age IBD - identical-by-descent ISOGG - International Society of Genetic Genealogy JPT - Japanese in Tokyo, Japan KHV - Kinh in Ho Chi Minh City, Vietnam ky - thousand years kya - thousand years ago LBIA - Late Bronze / Iron Age LGM - Last Glacial Maximum LN - Late Neolithic iii MED - Medieval MN - Middle Neolithic mtDNA - mitochondrial DNA NAT2 - N-acetylotransferase 2 PCA - Principal Component Analysis PCR - polymerase chain reaction rCRS - revised Cambridge Reference Sequence ROH - runs-of-homozygosity SGDP - Simons Genome Diversity Panel SNP - Single Nucleotide Polymorphism UP - Upper Paleolithic WLR - West Liao River YR - Yellow River YRI - Yoruba in Ibadan, Nigeria Note – the IDs of many ancient sample referenced in this analysis consist of numbers and letters. The identity of and further information regarding these samples is contained in the main text and supplementary materials. iv List of Figures Figure 1.1. A timeline of pre-and protohistoric Japanese periods. .......................................... 14 Figure 1.2. Newly sequenced genomes from throughout the Japanese archipelago. ............ 17 Figure 1.3. A topographic map showing locations of published ancient samples included in Chapter 1 analysis............................................................................................................................... 18 Figure 1.4. Summary of bioinformatic pipeline. ........................................................................ 20 Figure 1.5. Simulation scheme for modified “Out-of-Africa” dispersal with adjusted parameters for divergence time (T) and population size for Jomon. ........................................ 27 Figure 1.6. Sampling locations, dates, and genome coverage of ancient Japanese individuals. .......................................................................................................................................... 31 Figure 1.7. Pairwise outgroup-f3 statistics analysis. .............................................................. 34 Figure 1.8. Principal component analysis (PCA) of present-day East Eurasia with projected ancient individuals. ............................................................................................................................ 36 Figure 1.9. Highlighted results from ADMIXTURE analysis. .................................................. 38 Figure 1.10. Phylogenetic analysis of the Jomon lineage. ......................................................... 39 Figure 1.11. Geographic and temporal display of f4(Mbuti, X; Y, Jomon) results. ................ 42 Figure 1.12. Runs of homozygosity (ROH) analysis of the 8.8kya Jomon individual “JpKa6904”. ........................................................................................................................................ 44 Figure 1.13. Geographic and temporal display of differences in affinities between Jomon sub-periods with ancient and modern populations using f4 statistics. . ..................................... 46 Figure 1.14. A comparison of outgroup-f3 results for the Jomon dataset grouped by the island of origin. ................................................................................................................................... 48 Figure 1.15. Geographical map highlighting results from f4(Mbuti, X; Jomon, Yayoi). ....... 50 Figure 1.16. Genetic ancestry of the Yayoi period....................................................................... 51 Figure 1.17. Correlation of shared genetic drift between the Yayoi and Jomon individuals with the geographic locations of Jomon sites. ............................................................................... 52 Figure 1.18. Genetic changes in the Kofun period. .................................................................... 53 Figure 1.19. Genetic ancestry of Kofun individuals. .................................................................. 54 Figure 1.20. Dating admixture in the Kofun individuals using DATES. ................................. 56 Figure 1.21. Geographic and temporal display of f4(Mbuti, X; Kofun, Japanese) results. .... 58 Figure 1.22. Genetic ancestry of the Kofun and modern Japanese modelled as three-way admixture. .......................................................................................................................................... 60 Figure 1.23. Tripartite structure in southern Ryukyu Islands. ................................................. 62 Figure 1.24. Genomic transitions in parallel with cultural transitions in pre- and protohistoric Japan. ........................................................................................................................... 68 Figure 2.1. Assessing the performance of imputation using GLIMPSE. ................................. 85 Figure 2.2. Overlapping heterozygous SNP sites produced by imputation. ........................... 86 Figure 2.3. Haplotypic analysis of 116 imputed ancient Asian individuals. ........................... 88 Figure 2.4. Assessing geographic and temporal intra-variation within Jomon. .................... 93 Figure 2.5. Adaptative allele frequencies in ancient Asian populations and modern-day references. . ......................................................................................................................................... 97 Figure 2.6. Regional and temporal differences in EDAR heterozygosity in imputed Asian data. ................................................................................................................................................... 100 Figure 2.7. Tracing mutations in the alcohol metabolism pathway over time within Japan and China. ........................................................................................................................................ 102 Figure 3.1. A proposed model for the formation of southern Native Americans. ............... 113 Figure 3.2 Exploring ancestral sources in ancient and present-day South America. ......... 122 Figure 3.3. Genetic ancestry of LagoaSanta modelled as two-way admixture. ................... 123 v Figure 3.4. Revised model of South American ancestry based on admixture modelling. . 127 List of Tables Table 1.1. Summary of the time series of ancient Japanese data. ............................................. 32 Table 1.2. Comparing the tripartite admixture model to individual two-way models in southern Ryukyu Islands. ................................................................................................................. 63 Table 3.1 An overview of certain ancient southern Native American individuals and populations. ...................................................................................................................................... 112 Table 3.2. Testing of the fittings of models for the genetic ancestry of the ancient LagoaSanta. ....................................................................................................................................... 124 Table 3.3. Testing of the fittings of models for the genetic ancestry of Karitiana and Surui. ............................................................................................................................................................ 126 vi Research Outputs Research article: Ancient genomics reveals tripartite origins for Japanese populations. Niall P. Cooke*, Valeria Mattiangeli*, Lara M. Cassidy, Kenji Okazaki, Caroline A. Stokes, Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, Takashi Gakuhari and Shigeki Nakagome Published in Science Advances (September 2021). (*Refers to equal contribution to this work.) Literature review: Fine-tuning of Approximate Bayesian Computation for human population genomics. Niall P. Cooke and Shigeki Nakagome Published in Current Opinion in Genetics and Development (July 2018). Conference posters and talks: A simulation study on the origin of natural selection in an admixed population. Niall P. Cooke, Daniel G. Bradley and Shigeki Nakagome Poster presentation at the Irish Society of Human Genetics Annual Scientific Meeting (ISHG) in Dublin, Ireland (September 2018), the 47th European Mathematical Genetics Meeting (EMGM) in Dublin, Ireland (April 2019) and at the Society for Molecular Biology and Evolution (SMBE) Annual Conference in Manchester, United Kingdom (July 2019). vii A time series of ancient Japanese genomes. Niall P. Cooke, Valeria Mattiangeli, Lara M. Cassidy, Kenji Okazaki, Caroline A. Stokes, Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, Takashi Gakuhari and Shigeki Nakagome Poster presentation and lighting talk at Human Evolution- From Fossils to Ancient and Modern Genomes virtual conference (November 2021). Genomic insights into the insular hunter-gather-fisher Jomon of Japan. Niall P. Cooke, Valeria Mattiangeli, Lara M. Cassidy, Kenji Okazaki, Caroline A. Stokes, Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, Takashi Gakuhari and Shigeki Nakagome Talk scheduled to be given at the 13th Conference on Hunting and Gathering Societies in Dublin, Ireland (July 2022). Courses undertaken:  "Teaching and Supporting Learning as a Graduate Teaching Assistant", Trinity College Dublin (5 ECTs)  “Planning and Managing Your Research and Career”, Trinity College Dublin (5 ECTs) viii Acknowledgments The first person I would like to thank is my supervisor Shigeki Nakagome for giving this great opportunity, and for all of the guidance and advice over the past few years. He has also shown great patience and humour in every encounter, and his unrelenting dedication to this project has brought my work to a much higher standard. My thanks as well to my co-supervisor Dan Bradley, whose 3rd year literature review on Neanderthals all the way back in 2015 set me down a path of population and ancient genomics. His constant practical support made this project possible, and the enthusiastic and knowledgeable feedback throughout kept it on the right track. A great debt is also owed to Valeria Mattiangeli, whose skills in the ancient lab are world class, and who is responsible for the high quality of ancient Japanese data that made this project so unique. Lara Cassidy has also had an incredibly positive influence on this thesis, from her initially getting me set up on the server and explaining some of basic commands and analytical concepts that I would go on to use on a daily basis, to her tremendous contributions in shaping our paper and making it into a much more comprehensive piece of work. The entire extended Bradley-Cassidy-lab group have also been a fantastic and friendly resource in terms of knowledge, ideas, paper sharing and bioinformatic support. I am grateful to have welcomed into their lab meetings, journal clubs and Christmas parties. My particular thanks to Bruno Ariano, whose continued communication on Slack has provided me with a lot of practical advice and reassurance, and who set a high standard for IBD plots that forced me to up my game. I also want to thank Iseult Jackson and our student Caroline Stokes for the enjoyable and rewarding experience of co-supervising an undergraduate project. Madeleine Murray has also been a welcome addition to the Nakagome team, and her preliminary imputation work was of great value in getting that analysis up and running. I also greatly appreciate the work done by our many collaborators from Kanazawa University and throughout Japan in securing the precious ancient samples I was privileged to analyse - Takashi Gakuhari, Kenji Okazaki, Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, Akihiko Matsumoto, Masafumi Itoh and Yoshitaka Kojima. I hope that I can meet at least some of you in person when I finally make it out to Japan soon. I also have a great appreciation for ix our collaborators in the GenomeAsia100K consortium, not just for sharing of resources, but also their continued communication and enthusiasm for my analysis. I am pretty sure that this is the first ancient genomics PhD project conducted within Trinity’s Department of Psychiatry, and or even in the School of Medicine – but nevertheless I have really enjoyed my time being based in the Trinity Translational Medicine Institute on the campus of St. James’s Hospital. I want to thank Marcus and Fíana, my neighbours in Office 1.17 (the Genomics office) for making it a fun place to work. The Neuropsychiatric Genetics Group and the Dept. of Psychiatry as a whole are a fantastic and inspiring bunch of people, and I always felt very welcome in spite of the fact that I was working on a completely different type of project. I am grateful for the many friends I have made from around the building too, through the One O’Clock lunches and the TTMI Social Committee’s Payday PubDays. I also want to express my gratitude for the support and advice from the “Write Stuff” writing group (Emma, Gráinne, Stephen and Anna). Unfortunately, this PhD will forever be associated in my memory with certain global events that began in March 2020 which resulted in a long 18 month-slog of remote working in a small studio apartment. I don’t think this thesis could have been completed under these circumstances without help from my family and friends. I want to thank for my parents for their continued support throughout this PhD endeavour, being impressed when I try to explain what I am working on and for sporadically sending me money for takeaways. I want to thank my siblings as well, especially my sister Laura, as well as her husband Talha. Lockdown was made more bearable through the many remote games, quizzes and Strava activities, and I want to particularly extend my thanks to Mark, Max, Billy, Dr. Pearse and very-soon-to-be-Dr. Tim, and all the member of the various prolific group chats (such as Das Affenhaus, OK Zoomer, various iterations of “the Salty Spittoon”, and Live Lab Love). ありがとうございます (“thank you”) x Table of Contents Declaration ......................................................................................................... i Abbreviations .................................................................................................... ii List of Figures ................................................................................................... iv List of Tables ...................................................................................................... v Research Outputs ............................................................................................. vi Acknowledgments ........................................................................................... viii Table of Contents ............................................................................................... x Abstract ............................................................................................................. 1 General Introduction ......................................................................................... 2 Chapter 1 - A time series of ancient genomes from the Japanese archipelago .. 13 Introduction ......................................................................................................................... 13 Materials and Methods ........................................................................................................ 16 Results ..................................................................................................................................30 Discussion ............................................................................................................................ 64 Chapter 2 - A deeper GLIMPSE: Analysis of a large dataset of imputed ancient Asian individuals .............................................................................................. 71 Introduction ......................................................................................................................... 71 Materials and Methods ........................................................................................................ 78 Results .................................................................................................................................. 83 Discussion .......................................................................................................................... 103 Chapter 3 - Exploring potential ancient Asian ancestry in southern Native Americans ....................................................................................................... 111 Introduction ........................................................................................................................ 111 Materials and Methods ....................................................................................................... 116 Results ................................................................................................................................ 120 Discussion .......................................................................................................................... 127 Final Discussion ............................................................................................ 130 Appendix ....................................................................................................... 132 Appendix 1 – Supplementary material for Chapter 1 ........................................................ 132 Supplementary Note 1.1 ........................................................................................................... 182 Supplementary Note 1.2 .......................................................................................................... 183 Supplementary Note 1.3 .......................................................................................................... 186 Appendix 2 – Supplementary material for Chapter 2 ........................................................ 193 Appendix 3 – Supplementary material for Chapter 3 ........................................................ 213 References ..................................................................................................... 222 1 Abstract The objective of this thesis is to use ancient genomic data and a variety of bioinformatic approaches to explore human evolutionary history in the Japanese archipelago and mainland Asia. Chapter 1 presents a dense genomic exploration of Japanese prehistory through the analysis of time series of ancient genomes from different cultural periods, including 12 ancient individuals which were sequenced as part of this project. This analysis reveals that the indigenous Jomon hunter-gatherer population diverged from other Asian populations ~20kya, before maintaining a small population size of ~1000 without any major significant contribution from outside of the archipelago. The isolation of the Jomon ended ~3kya with the arrival of people with northeast Asian ancestry during the Yayoi period, which is associated with the introduction of wet rice farming to the region. Surprisingly, a third wave of East Asian ancestry is observed during the imperial Kofun period. These three ancestral sources are demonstrated to characterise present-day Japanese populations, providing the first genetic evidence for a newly proposed tripartite structure for present-day Japanese populations. This structure is also shown to characterise a historical population from the southern Ryukyu Islands. Chapter 2 demonstrates through extensive testing that imputation can reliably increase the genomic resolution of ancient individuals from across Asia. IBD and ROH-based analyses of imputed data reveal broader population dynamics within mainland Asia as well as previously unobserved geographic substructure within the Jomon. Analysis of inferred adaptive alleles gives a direct insight into the evolution of traits specifically associated with Asian populations. The absence of a highly frequent and multifaceted mutation in the EDAR gene in the Jomon population suggests a more recent origin for this than previously proposed. Evidence for the atypical metabolism of alcohol is observed ~4kya, but it appears that separate mutations in the same pathway occurred at different stages. Finally, Chapter 3 explores potential early migrations from Asia into South America by searching for the ancient origin of “PopY”, an unknown population responsible for Australasian-like ancestry in certain Native American populations living in South America. Admixture modelling reveals the presence of ancestry from Ancient North Siberians and the Hoabinhian hunter-gatherers of Southeast Asia in the ~10kya “LagoaSanta” population from Brazil. Efforts to replicate this in present-day Karitiana and Surui are unsuccessful, due to the influence of more recent admixture events. 2 General Introduction Project overview The objective of this thesis is to explore the demographic and evolutionary history of humans who have lived in the Japanese archipelago and mainland Asia using ancient genomic data and state-of-the-art bioinformatic approaches. Populations from these regions have been historically underrepresented in large-scale genomic analyses (Popejoy and Fullerton, 2016), and until recent years have been almost entirely absent from the field of ancient genomics (Marciniak and Perry, 2017). The lack of available genetic resources has limited our understanding of how factors such migrations, environmental pressures, and advancements in agriculture and technology have shaped the present-day diversity observed in these areas today. Analysis of ancient genomic data over a wide time frame allows an unparalleled opportunity to directly explore the region’s changing genetic landscape, and new findings into the formative events in early human history are of core interest to numerous academic disciplines such as archaeology, anthropology, linguistics, human biology, and population genetics. This thesis will analyse both newly-sequenced and previously published ancient genomic data using current computational approaches to provide novel insights into the pre- and proto-history of Asia with three separate goals: 1. Chapter 1 contains a comprehensive genomic overview of the Japanese archipelago through the analysis of a time series of ancient data. This includes the first reporting of twelve newly sequenced ancient genomes from the hunter-gatherer Jomon population and the previously unsampled imperial Kofun period. This chapter is adapted from the paper “Ancient genomics reveals tripartite origins of Japanese populations” (Cooke et al., 2021), published in Science Advances in September 2021. 2. Chapter 2 explores the potential of imputation to increase the available genetic information from ancient Asian individuals. This chapter demonstrates for the first time that high rates of imputation accuracy can be obtained for ancient Asian populations, including the Jomon hunter-gatherer-fishers of Japan. A constructed dataset of imputed ancient individuals is then analysed to obtain a more detailed 3 overview of the genomic landscape of mainland Asia and the Japanese archipelago, and to explore origins of several adaptive traits associated with these regions. 3. Chapter 3 uses ancient Asian data to explore the potential genetic legacy of early migrations into the Americas by focusing on the origin of a mysterious excessive genetic affinity between certain Native American populations living in South America with Australasian populations. Each chapter discusses its own specific background and methodology. This introduction will provide general information regarding the approaches and techniques that are throughout this project, and a general summary of the current genomic perspective of human evolution in Asia. It is divided into the three main sections: 1) an overview of how genomic variation can be used to explore the past; 2) an introduction to the field of ancient genomics; and finally, 3) our current understanding of the genetics of prehistoric Asia. Population Genetics: using variation to explore human evolution The ~3 billion base pair sequence of the human genome is overwhelmingly similar across all Homo sapiens, with any pair of individuals sharing approximately 99.5% of their genetic sequences (Griffiths et al., 2015); this, however, still leaves a huge number of potential differences within the remaining millions of sites. The wide range of phenotypic diversity observed across the world at both the individual and the population level is testament to the huge amount of genetic variation that has amassed and been maintained within our species over the course of its evolution. Understanding how these differences emerged is hugely informative to the movements and interactions of early human populations, as well as the roles environmental, cultural, and technological changes have played in shaping this variation. Over the past century the field of population genetics has studied the processes that shape and maintain genetic variation within groups of living organisms and contextualised them within mathematical and statistical frameworks. An important early principle known as “Hardy-Weinberg Equilibrium” established a simple mathematical proof that an allelic variant could be maintained at a constant frequency through successive generations in a hypothetical setting with several assumptions that 4 do not reflect real scenarios. This provided a baseline to explore four main processes that alter genetic frequencies within a population, and bring about differences in two separate populations: 1) genetic drift, which refers to random increase or decrease in genotype frequencies that occurs during the sampling of genotypes as they are passed from parent to offspring through successive generations; 2) mutations, which occur at a molecular level due to copying errors during genetic replication and introduce new alleles into a population; 3) natural selection, which allows mutations that confer an advantage to an organism in a particular setting to rise in frequency within a certain population over time; and 4) gene flow, which refers to the exchange of genetic material from one population to another through migrations or admixture introducing new variants into a population and altering the frequencies of pre-existing variants. The extent of these processes’ effect is intrinsically linked to population size (e.g., changes introduced through genetic drift may will have a larger impact in a small population) and time (e.g., an older mutation has a better chance of being at a higher frequency than a more recent one). The earliest mathematical models developed predicted how the frequency of an allele can be expected to change in frequency forwards in time. A major conceptual shift occurred in the 1980s with the emergence of the theory of coalescence (often credited t0 John Kingman), which looks at allelic variation in the context of its evolutionary history and tries to explain how it emerged backwards in time (Hartl and Clark, 2007, Wakeley, 2009), which represented a breakthrough in how these models can be practically applied to real genetic data. In addition to theoretical frameworks, global human variation has been directly assessed using biological markers as far back as the 1910s, initially through the analysis of differences in blood markers from different populations (Hirschfeld and Hirschfeld, 1919). Following the identification of DNA as the molecular unit of inheritance in 1953 (Watson and Crick, 1953), it became the focal point for understanding genetic variation, even before it was possible to directly be sequenced - a rudimentary approach was developed that involved cutting up isolated DNA with site-specific restriction enzymes, and comparing the differences in fragment lengths (Danna and Nathans, 1971). Today, built upon a series of technological breakthroughs – the development of Sanger sequencing to directly decode DNA in the 1970s (Sanger 5 et al., 1977) the ability to rapidly amplify DNA fragments using polymerase chain reaction (PCR) in the 1980s (Mullis and Faloona, 1987), the publication of the first draft of the human genome in the early 21st century (Lander et al., 2001) and the huge reduction in cost per genome that has come about through the rise of next generation sequencing (NGS) technologies since the 2010s (as reviewed by Metzker (2010))– the primary tools for assessing human genetic variation has been genome-wide data for large numbers of individuals. The analysis of genomic data from diverse populations evolved the field of “population genetics” into “population genomics”. Many approaches have been developed to showcase genetic diversity based on genome-wide differences in allelic information. One of the most popular methods has been principal component analysis (PCA) (Patterson et al., 2006), which simplifies high dimensional genetic data into synthetic variables that pull the most distantly related samples in a dataset apart, and whose patterns can resemble population structure and mirror geographic variability (Novembre et al., 2008). Other methods such as ADMIXTURE, can define unique ancestral coefficients within individuals (Alexander et al., 2009). At the population level, allele frequency information can be used create phylogenies (Pickrell and Pritchard, 2012) or directly estimate degrees of genetic drift or gene flow using f- and D- statistics (Patterson et al., 2012), which can be further expanded into building admixture graphs (Patterson et al., 2012) or estimating ancestral sources (Haak et al., 2015). If two loci are located physically close to each other on a chromosome, they are less likely to be broken down via recombination and more likely be inherited together over successive generations. Measures called “linkage disequilibrium” represent the strength of the linkage between different genetic sites. At a broad genome level, these linkage patterns are informative about historic relationships between different populations, while at the molecular level, long or short stretches at particular sites can be informative about when a recent mutation occurred. Efforts to expand diversity in genomic resources has led to projects such as the 1000 Genomes Project (1000 Genomes Project, 2015) and the Simons Genome Diversity Project (Mallick et al., 2016), whose aim to generate sequence data from many individuals and populations from throughout the world in order to better understand variation at a global level. Even as more genomic data becomes available from diverse 6 populations, there are still limitations, however, as to what this data can reveal about the complex story of human origins, and how the species arrived in different parts of the world. Many prehistoric human ancestries are not well represented by modern data, leaving any role they might have played in shaping variation undetectable. Similarly, events that have a major impact on genetic diversity, such as extreme population replacements or rapid expansions will skewer the ability to reconstruct much older events. The emergence of a new field to address these analytical shortcomings will be the focus of the next section. A brief overview of ancient genomics One of the most ground-breaking developments for exploring human history - and one of the most exciting areas of scientific discovery over the past decade - has been the rapidly expanding field of ancient genomics. The retrieval of ancient DNA (aDNA) from a historical artifact or long deceased organism provides a direct molecular insight into genetic variation at a particular point in the past. While the concept of this type of approach is not a new one - the earliest reported successful attempt was the sequencing of two small mitochondrial DNA fragments from a quagga (an extinct zebra-like animal) in the mid-80s (Higuchi et al., 1984) – the rise of high throughput sequencing technologies and improvements in protocols to overcome the inherent obstacles faced when working with ancient DNA has made findings in this once-niche field a regular sight in high impact scientific journals. There are three major challenges to working with ancient genetic material: 1) any aDNA that has survived as part of an ancient artifact over centuries or millennia can often be in such low quantities that it makes any effort to extract and amplify it particularly prone to contamination from modern DNA sources (which is a particular issue with ancient human remains, whose DNA can be indistinguishable from the researchers who worked with the sample) (Hofreiter et al., 2001b); 2) the small amount of aDNA that has survived has been degraded into much shorter and fragments rendering them difficult to successfully isolate (Paabo, 1989); and 3) the base pair sequence of these short aDNA fragments is susceptible to chemical damage, particularly to the deamination of cytosine bases to uracil, which can cause misincorporations during amplification that lead to unwanted error at transition 7 variant sites in the sequence data (Hofreiter et al., 2001a), which has been found to be particularity prevalent the ends of the aDNA fragments (Briggs et al., 2007). After nearly four decades of experimentation and research, strict protocols and innovative technical solutions have overcome and vastly reduced the impact of these issues, while the power of NGS sequencing platforms have made it possible to salvage sufficient amounts of ancient DNA to enable comprehensive analysis. Contamination has been limited through the carrying out of extractions in a dedicated sterile facility, with researchers wearing full body anti-contamination suits and taking extreme care to keep work stations clean (Cooper and Poinar, 2000). Protocols have been established to maximise the yield of endogenous aDNA (Der Sarkissian et al., 2014, Orlando et al., 2015, Damgaard et al., 2015), with a breakthrough finding that the petrous portion of the temporal bone (located at the base of the skull) gives systematically higher rates of yielded aDNA (Gamba et al., 2014, Pinhasi et al., 2015). The treatment of samples with uracil-DNA glycosylase (UDG) has also been found to counteract the effects of the deamination and reduce sequence errors (Hofreiter et al., 2001a). Similarly, the bioinformatician responsible for analysing the data must always keep potential limitations in mind and adjust the in silico pipeline to reduce their impact, such as through relaxing parameters setting for the alignment of raw data to a reference genome to allow for some mismatches, or to restrict particular analyses to transversion variants only. Analysis of ancient human data has changed long held perceptions about early human evolution, such as by revealing the extent of the genetic contribution from early human-like hominids the Neanderthals (Green et al., 2010) and Denisovans (a sister- group to the Neanderthals first recognised as its own species only through sequencing of ancient DNA (Reich et al., 2010)). Dense sampling of ancient humans from different stages of West Eurasian prehistory has shown the profound effect that cultural changes have had on the genomic landscape, initially with the advancement with agriculture and then again by Bronze Age migrations of herders associated with pastoralism and dairy farming from the Western steppe region (Gamba et al., 2014). Samples from Asia, however, were limited to only a handful of genomes (Marciniak and Perry, 2017) until recent years, which have seen many large studies and datasets published. An overview of the ancient genomic analysis of Asia is the focus of the next section. 8 The genetics of prehistoric Asia as told through aDNA This section will give an overview of the emerging picture of the prehistory of mainland Asia, as revealed through the analysis of ancient genomes. The earliest inhabitants of East Eurasia It is unclear when exactly an anatomically modern humans first entered the Asian continent; widespread reporting of human teeth found in several caves in southern China that were dated as far back as 80 -120kya (Liu et al., 2015) have since been disputed, with alternative dating strategies and ancient DNA supporting a more recent arrival of 50 to 45kya (Sun et al., 2021, Hublin, 2021). It is generally accepted that all present-day non-African Homo sapiens derive the majority of their ancestry from a major expansion out of Africa into the Eurasian supercontinent 50,000 to 60,000 years ago (kya) (Bergstrom et al., 2021). At some point after this expansion, these early humans admixed with Neanderthals (Green et al., 2010), likely on multiple occasions (Vernot et al., 2016) resulting in the exclusive presence of a small proportion of Neanderthal DNA in all non-Africans. After this split, it appears that Asian populations received a small amount of additional gene flow from Denisovans (Reich et al., 2010, Meyer et al., 2012, Prufer et al., 2017) There are only a handful of genome-wide sequenced Asian individuals from before the Last Glacial Maximum (LGM), a period of time 26 to ~19kya in which the Earth’s ice sheets reached their maximum integrated volume (Clark et al., 2009). The oldest individual sequenced from this region is a 40,000-year-old individual (“TY”) from Tianyuan Cave, near Beijing, in China. This individual was found to be genetically closer to present-day Asians than Europeans and formed a unique ancestry that was basal to all modern populations with Asian ancestry (Yang et al., 2017). Tianyuan- related ancestry was later determined to have been widespread prior to the LGM based on a high degree of shared genetic drift with a 33kya individual from the Amur River Basin in North-eastern China (Mao et al., 2021). A pair of Upper Palaeolithic individuals dating to ~31.6kya from the Yana Rhinoceros Horn Site in the extremes of northeast Siberia (referred to as “Yana_UP”), however, were found to share comparatively more alleles with West Eurasians than was observed in Tianyuan. This 9 led to their being designated as belonging to a unique “Ancient North Siberian” (ANS) lineage (Sikora et al., 2019). A 40kya genome from the Salkhit Cave in Mongolia was found to share ancestry related to both Tianyuan and the ANS lineage (Massilani et al., 2020) showing both of these early ancestries were widespread pre-LGM. The emergence of the ancestors of the Native Americans in Siberia At some point after the onset of the LGM (~26kya) a lineage ancestral to Native Americans emerged within Siberia; however, the understanding of the nature of the relationship between the early Asians and present-day Native Americans is under constant revision as more ancient genomes from the region are analysed. A 24kya individual from the Mal’ta Upper Paleolithic site (“MA1”) from near Lake Baikal in south-central Siberia was found to be closer to present-day Native Americans than East Asians, which was designated as part of an “Ancient North Eurasian” (ANE) lineage (Raghavan et al., 2014). This lineage was demonstrated to be descended from ANS lineage associated with “Yana_UP” with additional ancestry from early West Eurasian lineages (Sikora et al., 2019). A much more recent 9.8ky old Mesolithic individual from northeast Siberia, “Kolyma1”, was more a closer relative for present-day Native American than the older MA-1 individual (Sikora et al., 2019). This individual was part of a separate lineage to ANE, termed the Ancient Palaeo-Siberian” (APS), which represented a major genetic turnover in northeast Siberia from the older ANS lineage. A ~14kya individual from the Ust-Kyakhta-3 site south of Lake Baikal (“UKY”) has also been designated as belonging to the APS lineage (Yu et al., 2020). APS has been proposed as a mixture of ANS with additional ancestry from East Asian populations (Sikora et al., 2019), although it has also been suggested that a 17kya sample on the Lena River near Lake Baikal (“Khairygas-1”), which is genetically distinct from ANS, contributed heavily to AP (Kilinc et al., 2021). The East Asian component of this lineage may have been mediated in the Amur Region (Mao et al., 2021) prior to the expansion to the Americas. At the time of writing, the APS lineage stands as the closest relatives to Native Americans, although it cannot be ruled out that further sequencing from during or after the LGM may once again change this narrative. Multiple lines of evidence, 10 however, have suggested that the peopling of America was a dynamic process and not a single occurrence (Moreno-Mayar et al., 2018a, Ardelean et al., 2020, Becerra- Valdivia and Higham, 2020, Bennett et al., 2021) as will be discussed in greater detail in Chapter 3. Insights from the growing post-LGM genomic record The number of genomes from northern East Asia dating to within the past 10k years that have been sequenced has grown exponentially in recent years, revealing a complex history of dynamic population movement and admixture between many distinct and diverse populations. The Upper Paleolithic lineages ANS and ANE lineages (as represented by “Yana_UP” and “MA-1” respectively) ultimately were replaced by the APS lineage (Sikora et al., 2019, Yu et al., 2020). A handful of present-day populations living in northeast Siberia, including the Chukchi, Koryak and Itelman, retain a close affinity to the APS, as do present-day Native Americans (Sikora et al., 2019); however continued migration in the mid to Late Holocene from southern East Asian populations resulted in the replacement of the APS with a population referred to as “Neo-Siberians”, who are ancestral to many contemporary populations living in the region today (Sikora et al., 2019). Ancestry related to the 40kya TY individual was present in the Amur River region at least 33kya (Mao et al., 2021). A post-LGM individual dating to 19kya (“AR19k”), however, was found to be genetically distinct from these older samples and have a closer affinity to present-day populations in north East Asia, suggesting a major population change had occurred in this region during the LGM (Mao et al., 2021). A distinct genetic component that is associated with the Amur River Basin, located on the border of present-day Russian and China (also referred to as AR ancestry or northeast Asian ancestry) has been observed to have begun as early as 14kya (Mao et al., 2021). Neolithic foragers sampled from Chertovy Vorota Cave (Devil’s Gate Cave) in the Primorye Region of far eastern Russia have an affinity to this ancestry, as do present-day Tungusic-speaking populations (Siska et al., 2017, Sikora et al., 2019, Mao et al., 2021, Ning et al., 2020). This ancestry is broadly associated with ancient societies that practiced hunting, fishing and animal husbandry, with farming practices 11 only adopted in the historic era (Kuzmin, 2013, Ning et al., 2020). This lineage is genetically distinct from another deeply diverged population of hunter-gatherers known as the “Hoabinhian” living in Southeast Asia (McColl et al., 2018). In addition to the 24kya MA-1, the area surrounding Lake Baikal has been extensively studied through ancient human sequence data (de Barros Damgaard et al., 2018, Yu et al., 2020, Kilinc et al., 2021). The ANE related ancestry found in MA-1 was still present ~17kya in a sample from the Afontova Gora cave (“AG2”); however, it was largely replaced in the western Cis-Baikal area by the early Neolithic by North East Asian ancestry (de Barros Damgaard et al., 2018, Kilinc et al., 2021). In the Bronze Age however, a surge in ANE ancestry was observed, likely from an unsampled source population with a high affinity to MA-1(de Barros Damgaard et al., 2018). The more eastern Trans-Baikal area in contrast retained more longstanding continuity, while the vast north-eastern Yakutia also experienced an increase in affinity to northeast Asian ancestry over time (Kilinc et al., 2021). Dense sampling of the of Eastern Steppe region of present-day Mongolia, close to the Lake Baikal, has shown more dynamic changes in its ancestral profile with AR, ANE, the Bronze Age Baikal and Western Steppe ancestries fluctuating in proportions over the past ~5kya (Jeong et al., 2018, Jeong et al., 2020). At the beginning of the Neolithic period ~8ky, the broad population history throughout East Asia appears to be more diverse in the past than is observed in the present (Wang et al., 2021b, Yang et al., 2020). Many very old lineages appear to have been present in China that do not significantly contribute to present-day variation (Wang et al., 2021b). A clear separation into north and south East Asians is observed (Yang et al., 2020), and has possibly been seen as far back as 19kya (Mao et al., 2021). Ancient genomes of Neolithic populations from the Yellow River Basin in China were found to be clearly distinct from the identifiable ancestry of the more northerly Amur River Basin (Ning et al., 2020, Wang et al., 2021a). This region is associated with complex societies around millet farming by ~6kya, and Yellow River-related (YR) ancestral profiles are genetically similar to the present day East Asian profile (Ning et al., 2020). Neolithic populations sampled from the West Liao River, located between the Amur and the Yellow River Basins, have been shown to be influenced by both AR 12 and YR ancestries during periods, which corelated to changes in subsistence strategies (Ning et al., 2020). One of the most intriguing and mysterious parts of the Asian continent is the Japanese archipelago. The archaeological record shows a distinctive hunter-gatherer-fisher population emerging in ~16.5kya which preliminary sequence data has suggested as being a deeply diverged lineage from other East Asia populations (McColl et al., 2018, Kanzawa-Kiriyama et al., 2019, Gakuhari et al., 2020). This area’s geographic isolation from continental Eurasia, and subsequent series of rapid cultural changes make it a unique microcosm in which to study the migratory patterns that accompanied agricultural spread and economic intensification in Asia. A comprehensive analysis of ancient genomes from different periods of Japanese history in the context of other ancient Asian genomic data, and the impact of these early populations on present-day Japanese populations, will form the basis of Chapter 1 of this thesis. 13 Chapter 1 - A time series of ancient genomes from the Japanese archipelago Introduction The presence of stone tools has revealed that the Japanese archipelago has been occupied by humans for at least 38,000 years (ky), during what is referred to as its Palaeolithic period (Habu, 2004). The first distinctive culture in the region, however, does not occur until ~16.5kya, at approximately the end of Last Glacial Maximum (LGM), with the appearance of pottery shards with unique cord-like markings in the archaeological record. The period associated with these ceramics, which are among the oldest pottery in the world, was termed the Jomon period (Jo translates to “cord” in Japanese and mon translates to “mark”) (Habu, 2004). The people who lived during this period (who are referred to as the “Jomon”) are considered to be a complex society of hunter-gatherers-fishers (Habu, 2004). The Jomon have been called “affluent foragers” and their associated culture is characterised by the presence of large settlements, a high population density per site, low residential mobility, logistically organized subsistence strategies (as evidenced by the presence of storage pits and large shell-middens), and the construction of large ceremonial features (Habu, 2004). The Jomon period lasted for several millennia until ~3kya with the introduction of wet-rice cultivation to the region, at the beginning of the Yayoi period, named after the area in Tokyo in which the first pottery associated with the period was discovered (Mizoguchi, 2013). This period saw major changes occurring across the entire region, not only with the widespread adoption of systematic rice paddy field agriculture, but also saw a hierarchical class structure emerging to maintain the communal collaboration required for food production (Mizoguchi, 2013). The Yayoi period was relatively quickly followed by another significant cultural change, which was represented by the first appearances of large keyhole-shaped tumuli (Mizoguchi, 2013). The Kofun period (“Kofun” referring to a mounded tomb) began ~1.7kya and saw the centralisation of political power in the region and the beginnings of the imperial Japanese governance whose presence remains to this day. In addition to technological advances, this period also saw the appearance of the first written historical records and literary texts, such as the foundational Kojiki and Nihon 14 Shoki, which compiled myths and legends associated with the creation of the Japanese state and the reigns of the earliest emperors (Mizoguchi, 2013). How these distinctive cultural periods (a timeline of which can be seen in Fig. 1.1) have shaped the ancestry of present-day populations has long been a source of speculation. A well-known and enduring hypothesis is the “dual structure” model which was first proposed by Hanihara (1991) and posits that all Japanese are the admixed descendants of the indigenous Jomon with the migrants from East Asia who brought rice cultivation during the Yayoi period. This model was originally proposed based on the morphological analysis of crania and teeth, but has found further support from archaeology, biological anthropology and linguistics (as comprehensively reviewed by Hudson et al. (2020)). Broad genetic studies of Japanese populations have also shown clear population stratification across the archipelago, which also supports at least two population sources for present-day populations (Japanese Archipelago Human Population Genetics et al., 2012, Nakagome et al., 2015, Jinam et al., 2021a, Jinam et al., 2021b). Direct sampling of ancient DNA from people living during these three distinct cultural periods has the potential to be hugely informative about the demographic origins of present-day populations, and of the genomic impact of the major cultural transitions to agriculture and later to early state phase. Unfortunately, the Japanese archipelago has not been an ideal region for the long- term preservation of ancient genetic material due to its humid climate and the high acidity of the soil found in the volcanic islands (Pinhasi et al., 2015, Hofreiter et al., 2001b). In spite of this, three Jomon genomes with >1x coverage from the past ~3,500 years have been successfully sequenced (McColl et al., 2018, Kanzawa-Kiriyama et al., Figure 1.1 A timeline of pre- and protohistoric Japanese periods. 15 2019, Gakuhari et al., 2020). This has shown that the Jomon population as being a deeply diverged lineage from other East Asians (McColl et al., 2018, Kanzawa- Kiriyama et al., 2019, Gakuhari et al., 2020), with a broad range of between 18 and 38kya being proposed for the timing of their split (Kanzawa-Kiriyama et al., 2017) and a potential Southeast Asia origin (McColl et al., 2018, Gakuhari et al., 2020). Additionally, preliminary genetic analysis of two samples related to the Late Yayoi period has demonstrated their genomic affinity both to the Jomon and to present-day Japanese (Shinoda et al., 2019). The limited number of sequenced ancient Japanese genomes impedes potential insights into the impact of cultural change on the region’s genetic profile, and how these changes shaped the ancestry of the present-day population. A denser sampling of Jomon would not only give further insight into the origin of this unique population within Asia, but it would also show how the structure of this population may have changed over the course of the thousands of years it dominated the Japanese archipelago. Integrating data from the subsequent agrarian Yayoi period, as well as the previously unsampled Kofun period, would demonstrate the unknown genetic impact of both the agricultural transition, and later the state-formation phase on shaping modern Japanese populations. Chapter goal The goal of this chapter is to analyse a uniquely assembled time series of ancient data from the Japanese archipelago with other relevant published ancient samples from throughout East Eurasia using up-to-date population genetics and bioinformatic methods. This chapter presents an analysis of ancient genomic data from throughout pre- and proto-historic Japan and includes the first reporting of 12 newly sequenced ancient genomes from across the archipelago. This approach represents a unique and original opportunity to characterise the changes in the genomic profile in Japan over the past ~9,000 years. This chapter is adapted from the paper “Ancient genomics reveals tripartite origins of Japanese populations” (Cooke et al., 2021), published in Science Advances in September 2021, and also contains updated analyses conducted after the paper’s publication. 16 Materials and Methods Materials Newly sequenced ancient Japanese genomes 14 ancient Japanese individuals from 6 archaeological sites across the west and central parts of the Japanese archipelago were screened for the presence of ancient DNA as part of this project. Four of these sites were located on the island of Honshu (Funaguru, Iwade, Kosakus and Odake), and two were located on Shikoku (Hirajo and Kamikuroiwa) (Fig. 1.2a) All ancient individuals were associated and dated to the different stages of the Jomon period (Initial, Early, Middle, and Late Jomon), with the exception of three individuals from the Iwade site, who were associated with the Kofun period (further dating information for each individual is included in Appendix Table 1.1). DNA was extracted from the petrous bone for all samples (an example of which is photographed in Fig. 1.2b) with the exception of “JpKa2” for whom only a tooth was available. Ancient samples were processed at dedicated ancient DNA facilities at Kanazawa University, Japan and Trinity College Dublin, Ireland. Wet-lab work conducted in Trinity College Dublin was led by Dr. Valeria Mattiangeli, with assistance in drilling for several samples by the present author. Additional libraries for samples “JpFu1” and “JpHi01”, which were sequenced on the NovaSeq 6000 Illumina platform, were prepared by final-year undergraduate student Caroline Stokes and included in analysis. 17 (a) (b) (a) A map of the archaeological sites from which ancient individuals were sampled as part of this analysis (with names of major islands highted in italics) (b) A photograph of one of the petrous bones (from individual “JpFu1”) from which ancient DNA was successfully extracted. Published ancient genomic data Over the course of this project, many papers were published based on ancient genomics whose newly sequenced data were made publicly available. A large dataset of published ancient genome data deemed to be relevant to this analysis was assembled to broaden the scope of this study. This included ancient sequence data from individuals from the Central and Eastern Steppe (de Barros Damgaard et al., 2018, Jeong et al., 2018), Northeast Siberia (Sikora et al., 2019), Southeast Asia (McColl et al., 2018) and continental East Asia (Ning et al., 2020, Yang et al., 2020). (Fig. 1.3; a full list of samples can be found in Appendix Table 2). This analysis was greatly enhanced by the inclusion of five additional previously published ancient samples from Japan (represented with triangles in Fig. 1.3):  two ~3.5kya individuals from the Late Jomon period excavated from the Funadomari site on Rebun Island, north of Hokkaido (“F5” and “F23”) (Kanzawa-Kiriyama et al., 2019) Figure 1.2. Newly sequenced genomes from throughout the Japanese archipelago. 18  a ~2.8kya individual from the Final Jomon period from Ikawazu, Honshu (“IK002”) (McColl et al., 2018, Gakuhari et al., 2020)  two ~2kya individuals associated with the agrarian Yayoi culture from the north-western part of Kyushu Island (“Yayoi_1” and “Yayoi_2”), who date to the Late Yayoi Period. The skeletal remains of these individuals exhibits Jomon-like characteristics rather than immigrant types but other archaeological materials clearly support their association with the Yayoi culture, and genetic analysis has differentiated them from Jomon-associated individuals (Shinoda et al., 2019) Additional information about these Japanese samples is included in Table 1.1. Figure 1.3. A topographic map showing locations of published ancient samples included in Chapter 1 analysis. Ancient data are represented by symbols designated by geographic or cultural context. Ancient sampled from Japan are marked by triangles. Further information about processed ancient individuals can be found in Appendix Table 1.2. 19 Methods Processing of ancient data A detailed description of the drilling and extraction procedure followed in the generation of ancient sequence data is given in Supplementary Note 1. A summary of the bioinformatic pipeline used to process, align, and call variations from the raw ancient sequence data, along with certain validation steps, is visualised in Fig. 1.4. Adapters were removed from raw FASTQ sequence data using either one of two methods based on whether they were single or paired end:  adapters were trimmed from single end data using cutadapt version 1.9.1 (Martin, 2011), allowing for a minimum overlap of 1 bp and removing reads that were shorter than 34 bp in length.  adapters were trimmed from paired end using AdapterRemoval v2 (Schubert et al., 2016), allowing for a minimum overlap of 1, removing “N”s and low quality bases in the sequence using the parameters–trimns and –trimqualities, filtering reads for minimum of 25 read length and 25 base quality score, and collapsing the paired end reads using –collapse. Only the outputs for “.collapsed”, “pair1.truncated” and “pair2.truncated” were included in subsequent processing. Trimmed reads were aligned to the hg19 / GRCh37 reference genome with the revised Cambridge Reference Sequence (rCRS) for mitochondrial DNA using bwa-aln in BWA version 0.7.5 (Li and Durbin, 2010) with relaxed parameters “-l 16500 -n 0.01 -o 2”. Aligned reads (i.e., BAM files) were filtered for a mapping quality of 20 and PCR duplicates were removed using samtools version 1.7 (Li et al., 2009a). Read group identifiers were added to each library using picard-tools version 1.101 (available at http://broadinstitute.github.io/picard/). The authenticity of ancient DNA was determined in non-UDG treated libraries screened using a MiSeq Illumina platform by looking at degradation patterns at 5’- and 3’-end respectively using MapDamage2.0 (Jonsson et al., 2013). At this point an estimate was made of the endogenous human DNA contents of each library by calculating the percentage of the remaining reads 20 compared to the number of trimmed reads; samples showing >~20% were selected to be sequenced to a higher coverage. All libraries generated from the same individual were merged into a single BAM file using samtools version 1.7 (Li et al., 2009a) after the addition of read group identifiers. Indel realignment was performed on each individual using GATK version 3.7-0 (McKenna et al., 2010). As the end of strands of ancient DNA are particularly prone to damage, the quality of two bases at the end of each read were manually reduced to a phred score of 2 (known as “softclipping”) to avoid calling genotypes from these regions. All ancient samples were processed, aligned, and had variants called using the same computational pipeline. Removing adapters (cutadapt for single end, AdapterRemoval for paired) Alignment to hg19 with rCRS mitochondria sequence (bwa-aln) Mapping Quality 20 (samtools) Remove PCR duplicates (samtools) Add read groups (picard-tools – AddOrReplace ReadGroups) Assessing degradation patterns in non- UDG treated libraries (MapDamage) Calculation of endogenous human DNA Merge all libraries from one individual together (samtools) Indel Realignment (GATK – RealignerTarget Creator and Softclipping (manually reducing the quality score each read) Pseudohaploid SNP calling of all samples (GATK - Pileup) Diploid SNP calling of high coverage ancients (GATK - HaplotypeCaller) Figure 1.4. Summary of bioinformatic pipeline. 21 Incorporation of published ancient data into analysis Where possible, raw FASTQ files were downloaded from an online public data repository (e.g., the European Nucleotide Archive); in cases where FASTQ files were not available, BAM files were instead downloaded. To allow a direct comparison of published with newly generated data, all files were put through the pipeline as newly sequenced individuals (Fig. 1.4). Quality assessment of downloaded data was conducted with FastQC version 11.4 (Andrews, 2010). If detected, adapters were removed using AdapterRemoval version 2.2 (Schubert et al., 2016). BAM files were re- aligned to the same reference genome using bwa-aln in BWA version 0.7.5 (Li and Durbin, 2010) using the same relaxed parameters. Throughout the analysis ancient individuals were grouped and primarily labelled by either geographic or cultural contexts. (Published data is visualised in Fig. 1.3 and further information for each sample is included in Appendix Table 1.2.) Contamination estimates and determining mtDNA haplogroup To assign mtDNA haplogroups to newly sequenced individuals and estimate the rate of mitochondrial contamination, trimmed fastqs files of newly sequenced individuals were additionally aligned to the rCRS reference mitochondrial genome. Aligned mitochondrial sequences for each individual were merged and processed using the same pipeline as the whole genome alignment. Consensus mitochondrial fasta sequences were determined for each individual based on a minimum depth of coverage of 5 and a base quality of 30 using mpileup and bcftools within the samtools package (Li et al., 2009a). Haplotypes based on these consensus sequences were determined using HAPLOFIND (Vianello et al., 2013). Mitochondrial contamination rates were estimated using an approach previously outlined in Cassidy et al. (2020): the rates of secondary (i.e., non-consensus) bases observed at the haplotype-defining and private mutations defined by HAPLOFIND were calculated. The adjusted rate calculated after removing SNPs genotyped as C or G (as these sites are more likely to be affected by post-mortem damage) is also reported. Molecular sex determination 22 The molecular sex of each ancient sample was determined using the method outlined in Skoglund et al. (2013); in summary, the ratio of reads aligning to the Y chromosome relative to the total number aligning to either sex chromosome (Ry) was calculated and assigned as a female if Ry < 0.016 or male if Ry > 0.075. This ratio was calculated using reads filtered for a higher minimum mapping quality filter of 30. Relatedness Kinship for all ancient individuals (newly generated and published) was estimated using “Relationship Estimation from Ancient DNA” (READ) (Monroy Kuhn et al., 2018). In cases where a first-degree relationship was identified, the lower coverage related individual was removed from subsequent analysis. (Individuals removed as first-degree relatives in published data are listed in Appendix Table 1.2). Y chromosome haplogroup analysis All newly sequenced ancient individuals whose molecular sex was determined to be male were piled up to the list of SNPs listed in the ISOGG SNP using GATK version 3.7-0 (McKenna et al., 2010). Only SNP sites with a minimum base quality of 30 were considered. Genotype calling and merging to published datasets To analyse ancient data in the context of present-day genetic variation, each ancient individual was genotyped for the bi-allelic SNP sites present on two different reference panels, which they were subsequently merged to: 1. the Human Origin Array (HOA) consisting of 594,896 SNP sites for 1,963 modern, ancient and reference genomes (Lazaridis et al., 2014) 2. the Simons Genome Diversity Panel (SGDP) consisting of 278 modern, ancient and reference genomes (Mallick et al., 2016) that has been filtered for autosomal transversions-only SNPs with a minor allele frequency of 1%, which 23 left 3,867,366 SNP sites in the merged data. Transversion SNP sites can be more accurately called in data as ancient samples are more susceptible to transition mutations over time. Due to the limited nature of ancient DNA, with coverages often being below 1x, many analyses were restricted to “pseudohaploid” data, in which potential heterozygous information is disregarded, and only a single allele is sampled per site. At each SNP site a high-quality single base (bq30) was randomly called per position using GATK version 3.7-0 (McKenna et al., 2010) Sites with >2 differing calls were not included. Datasets were merged using PLINK v1.90b4.4 (Purcell et al., 2007). Throughout this study, the modern Japanese population is represented by either data from the “Japanese” populations found in the HOA or SGDP, or by the “JPT” (i.e., Japanese in Tokyo) in the 1000 Genome Project Phase 3 (1000 Genomes Project, 2015). However, it is important to recognise that ancestral heterogeneity exists across the Japanese archipelago and that this diversity is not fully captured by these standard reference sets, and to keep this in mind when interpreting results. f-statistics f-statistics measure correlations in allele frequencies in sets of two (f2), three (f3) or four (f4) defined populations (Patterson et al., 2012). f3 statistics were calculated using qp3Pop (v300) and f4 using qpDstat (v662) with “f4 mode: YES”; both of these tools are found in the AdmixTools v6.0 package (Patterson et al., 2012). In both cases the population “Mbuti” was used as an outgroup. Standard errors in f4-statistics were calculated from the default block jackknife approach. Heatmaps for f3 values were created using the heatmap.2 package in R. Principal Component Analysis Principal component analysis (PCA) was conducted to see the positioning of ancient samples within present-day population genetic variation, using smartpca (v16000) (Patterson et al., 2006) with specified parameters of “killr2: YES”, “r2thresh: 0.2”, “numoutlieriter: 0”, “lsqproject: YES” and “autoshrink: YES”. Only a subset of the 24 larger dataset was included in PCA in order to just focus on Asian populations and ancient individuals. Present day populations (n=112) were chosen based on geography:  East Asia: Ami, Atayal, Dai, Daur, Han, Japanese, Korean, Lahu, Miao, Mongola, Oroqen, She, Tu, Tujia, Uygur, Xibo  Southeast Asia: Burmese, Cambodian, Dusun, Kinh, Thai, Yi  Central and South Asia: Balochi, Bengali, Brahmin, Brahui, Burusho, Hazara, Irula, Kalash, Kapu, Kusunda, Kyrgyz, Madiga, Makrani, Mala, Naxi, Pathan, Punjabi, Relli, Sindhi, Tajik, Yadava  Siberia: Aleut, Altaian, Chukchi, Eskimo (Chaplin), Eskimo (Naukan), Eskimo (Sireniki), Even, Hezhen, Itelman, Mansi, Russian, Tubalar, Ulchi These populations, along with all ancient individuals, were extracted from the larger dataset using PLINK v1.90b4.4 (Purcell et al., 2007). The subset of the data was converted from plink to EIGENSTRAT format using convert, part of the EIGENSOFT package (Price et al., 2006). Principal components were determined using the variation found within present-day individuals only with ancients projected onto this variation. Plotting was done in R using the output “smartpcaevec”. With the exception of “Yayoi_1”, only ancient individuals with at least 100,000 SNP sites were included in visualisation. The number of SNPs per individual was determined using plink -- missing. The percentage variation explained by each principal component was calculated using the output in the “smartpcaeval” file. ADMIXTURE The programme ADMIXTURE v.1.3.0 (Alexander et al., 2009) was used for unsupervised genetic clustering of present-day and ancient individuals in order to identify different ancestral components within the dataset. This analysis was conducted using a subset of the Human Origin Array panel, which included:  present-day populations from the South Asian, Southeast Asian, East Asian, Siberia, Oceania, and the Americans, with African and European ancestry being represented by Mbuti and Sardinia respectively (n = 786), and  ancient East Eurasian samples with at least 100,000 SNPs in the SGDP dataset (n = 187). 25 SNPs were pruned for sites in linkage disequilibrium using PLINK v1.90b4.4 (Purcell et al., 2007), with a sliding window size of 50 variants, a step size of 5 variants, and an r2 threshold of 0.2 (--indep-pairwise 50 5 0.2), leaving 189,487 SNPs for analysis. 10 replicates with random seed were run for each number of clusters (K) from 2 to 12, and the run with the lowest cross validation was plotted. Phylogenetic analysis Maximum-likelihood trees were inferred under various admixture models using TreeMix (v. 1.13) (Pickrell and Pritchard, 2012). A subset of ancient and present-day populations in the SGDP dataset was chosen to represent different ancestors across East Eurasia: Ust_Ishim, Yana_UP, MA1, Tianyuan, Salkhit, Papuan, Hoabinhian (La368 and Ma911), Kusunda, Jomon, Chokhopani, Shamanka_EN, Lokomotiv_EN, DevilsCave_N, USR1, Han, Ami, and Japanese. The dataset was filtered for non- missingness, which left 98,687 SNPs for analysis. Mbuti was used as an outgroup to root the tree and fitted models with no-migration or migrations from 1 to 5 to the data. A total of 1,000-1,500 replicates were run for each model, of which the tree with a likelihood the closest to a mean across the replicates was considered the most representative under a given model. Diploid calling and determining runs of homozygosity (ROH) Diploid genotype information was generated from the highest coverage individual generated by this study, the 8.8kya “JpKa6904” individual, and a number of other published high-coverage ancient genomes, including Yana1 (Sikora et al., 2019), USR1 (Moreno-Mayar et al., 2018a), Loschbour, Stuttgart_LBK (Lazaridis et al., 2014), Mesolithic Irish (SRA62), and Neolithic Irish (JP14) (Cassidy et al., 2020). (Further information about these samples is included in Appendix Table 1.3). Diploid genotypes were called using HaplotypeCaller, part of GATK version 3.7-0 (McKenna et al., 2010) based on the Phase 3 v5 1000 Genomes release panel (1000 Genomes Project, 2015) filtered for transversion sites and a minor allele frequency of 1%. These calls were further filtered for genotype quality 30 and minimum depth of 10 using vcftools v0.1.13 (Danecek et al., 2011). Transversion SNPs common to each 26 diploid ancient sample (total 660,027 SNPs) were included in runs of homozygosity (ROH) analysis. ROH was measured in each diploid genome using PLINK v1.90b4.4 (Purcell et al., 2007) with the following options: -homozyg --homozyg-density 50 --homozyg-gap 100 --homozyg-kb 500 --homozyg-snp 50 -- homozyg-window-het 1 --homozyg-window-snp 50 --homozyg-window- threshold 0.05 Demographic modelling with ROH Approximate Bayesian Computation (ABC) is a statistical framework in which the fit between observed data and that generated through simulation is quantified. This approach has been widely used in inferring human demography, as reviewed in Cooke and Nakagome (2018). ABC-based methods capture characteristics of genetic data though summary statistics. Here, the likelihoods of different potential combinations of population size (N) and split time (T) of the Jomon lineage were approximated by fitting genome-wide patterns of runs-of-homozygosity (ROH) generated with simulation data to those observed in the oldest individual in the Jomon dataset, “JpKa6904”. The ABC-based analysis presented here and published in Cooke et al. (2021) was conceived by Dr. Shigeki Nakagome and conducted in collaboration with him. The filtered diploid JpKa6904’s genome left 984,740 autosomal SNPs for ROH analysis (see “Diploid calling and determining runs of homozygosity (ROH)” for diploid processing and ROH parameters). The ROH profile was summarized into a spectrum of ROH fragments ranging from 0.5 to 100 Mb with a bin size of 1 Mb. 100 whole-genome simulations were generated using a coalescent simulator ms (Hudson, 2002) under a fixed scenario of the Out-of-Africa dispersal reconstructed from a previous study (Gutenkunst et al., 2009) with slight modifications (as visualized in Fig. 1.5). This simulation assumed 25 years per generation and 1.25×10-8 per generation (Scally and Durbin, 2012) and 0.625×10-8 per generation for a mutation rate and a recombination rate respectively. A broad search for the parameter space was first performed that is defined with the population size (N) from 500 to 2,500 and population split from 10 to 40kya (T) using the data from chromosomes 3 27 to 22 that includes likely scenarios for further testing with all autosomal chromosomes. Figure 1.5. Simulation scheme for modified “Out-of-Africa” dispersal with adjusted parameters for divergence time (T) and population size for Jomon. Simulation data was generated using ms (Hudson, 2002) to reflect the possible scenarios for when Jomon diverging from the rest of Asia and how big was their population at that time. Likelihoods of observing the ROH spectrum similar to that observed in JpKa6904 were estimated using the method developed in Osada et al. (2013) that measures the similarity of summary data (i.e. ROH spectrum) between the observed and simulated data based on a kernel density estimate and that calculates an approximated marginal likelihood (aML) of a given model. A normalized Gaussian kernel function with a bandwidth of 1 was used to compute kernel density estimates of aMLs. These estimates were then compared to identify the best-fitting model as an approximate Bayes factor (aBF); the aBFs were calculated between a model with the highest aML and any other models. 28 Admixture modelling Admixture events were modelled using qpWave v600 and qpAdm v1000 in the AdmixTools v6. 0 package (Patterson et al., 2012, Haak et al., 2015) using the merged merged datasets of ancient East Eurasians and the SGDP populations. This analysis used transversion sites with global minor allele frequencies >1% and used the parameter option of “allsnps: YES”. “Right” populations included as outgroups in the analysis consisted of 8 Eurasian populations: Sardinian (n = 3), Kusunda (n = 2), Papuan (n = 14), Dai (n = 4), Ami (n = 2), Naxi (n = 3) (26), Tianyuan (n = 1) (Yang et al., 2017), Chokhopani (n = 1) (Jeong et al., 2016), and Mal’ta (n = 1) (Raghavan et al., 2014), and this was conducted both with and without an additional subset of Jomon individuals (JpKa6904, JpOd181, and IK002). Populations chosen to be investigated as potential sources of ancestry based on f4 statistics and were only considered as sources of an admixture event only if the admixture between the sources is supported from modelling both with and without a subset of Jomon in the right populations. Nested p-values comparing the likelihoods of two models were calculated in R (i.e., does Model A fit better than Model B) using the formula: 1 − 𝑝𝑐ℎ𝑖𝑠𝑞(χெ௢ௗ௘௟_஺ ଶ − χெ௢ௗ௘௟_஻ ଶ , 𝑑𝑜𝑓 = 𝑑𝑜𝑓ெ௢ௗ௘௟_஺ − 𝑑𝑜𝑓ெ௢ௗ௘௟_஻, in which “dof” refers to its degree of freedom. Dating admixture events by DATES DATES v753 (Narasimhan et al., 2019) was used in this analysis to estimate the time of two different admixture events in the Kofun individuals:  Admixture between Jomon and Northeast Asian ancestry – in this model, Northeast Asian ancestry was represented by West Liao River individuals “WLR_BA_o” and “HMMH_MN”  Admixture between Jomon and East Asian ancestry – in this model East Asian ancestry was represented by Han Chinese in Beijing (CHB) in the 1000 Genome Phase 3 panel (1000 Genomes Project, 2015) 29 The estimated date in generation was converted into years with the assumption of 25 years per generation, which was further added into a mean age of median dates from three Kofun individuals (i.e., 1,348 years before present). The parameters used were: binsize: 0.001, maxdis: 1.0, runmode: 1, mincount: 1, and lovalfit: 0.45. The standard error was estimated from a weighted block jackknife method. Testing genetic continuity A strict test for genetic continuity was conducted between individuals associated with the Kofun period and present-day Japanese using the method outlined by Schraiber (2018). 30 Results Newly sequenced ancient genomes from pre- and protohistoric Japan A sufficiently high level of endogenous human DNA was found to be preserved in 12 of the total 14 sampled ancient specimens to warrant further shotgun sequencing to higher coverage. Non-UDG treated libraries showed post-mortem damage patterns (Appendix Fig. 1.1) consistent with that expected in ancient DNA. Whole genome coverages obtained from these 12 samples ranged from 0.88× to 7.51×. (Fig. 1.6 and Appendix Table 1.1) and the estimated level of modern human contamination was determined to be suitably low so as not to influence downstream analysis (<2.15%) (Appendix Table 1.4). Of the 12 individuals, 8 were determined to be female (6 Jomon and 2 Kofun) and 4 male (3 Jomon and 1 Kofun) (Appendix Table 1.5). Kinship analysis confirmed that none of the individuals were first- or second-degree relatives (Appendix Fig. 1.2). Analysis of uniparental markers showed a clear distinction into Jomon or Kofun samples (Table 1.1). Mitochondrial haplogroups for all Jomon individuals belong to either the N9b or M7a clades, which have been firmly associated with this population in previous studies (Kanzawa-Kiriyama et al., 2017, McColl et al., 2018, Gakuhari et al., 2020, Kanzawa-Kiriyama et al., 2019, Adachi et al., 2011); these haplotypes are rare outside of Japan in present-day populations (Tanaka et al., 2004). The three Jomon males belong to the Y chromosome haplogroup D1b1, which is present in modern Japanese populations but almost absent in other East Asians (Hammer et al., 2006). In contrast, the Kofun individuals all belong to mitochondrial haplogroups that are common in present-day East Asians (Zheng et al., 2011), while the single Kofun male has the O3a2c Y chromosome haplogroup, which is also found throughout East Asia, particularly in mainland China (Wang and Li, 2013). 31 (a) (b) Figure 1.6. Sampling locations, dates, and genome coverage of ancient Japanese individuals. (a) The individual locations for all ancient Japanese data are shown across the Japanese archipelago. Sites are marked with circles for individual genomes that were newly sequenced from this project and triangles if previously reported (Table 1.1 and Appendix Table 1.1). The colours represent three different periods of Japanese pre-(Jomon and Yayoi) and protohistory (Kofun). (b) Each sequenced individual is plotted with whole-genome coverage on the x-axis and median age (years before present) on the y-axis. The nine Jomon individuals are further split into five different sub-periods based on their ages (see Supplementary Note 1): Initial (JpKa6904), Early (JpOd274, JpOd6, JpOd282, JpOd181, and JpFu1), Middle (JpKo2), Late (JpKo13, JpHi01, F23, and F5), and Final (IK002). 32 Table 1.1. Summary of the time series of ancient Japanese data. A total of 17 ancient Japanese individuals were included in this analysis. Associated culture Sample ID Date range and median (cal BP) Coverage mtDNA contamination rate (%) Molecular sex mtDNA haplogroup Y chr haplogroup Ref. Newly sequenced in this study Jomon JpKa6904 8,646-8,991; 8,819 7.51 1.46 XX N9b3 - - JpOd274 6,119-6,289; 6,204 1.56 1.13 XY M7a D1b1d1 - JpOd6 5,934-6,179; 6,057 1.18 1.55 XX N9b3 - - JpOd181 5,751-5,917; 5,834 1.83 0.91 XY N9b1 D1b1d1 - JpOd282 5,737-5,902; 5,820 0.96 1.38 XY M7a1 D1b1d1 - JpFu1 5,478-5,590; 5,534 1.13 2.15 XX M7a1 - - JpKo2 4,294-4,514; 4,404 2.47 1.44 XX N9b - - JpKo13 3,847-3,978; 3,913 1.81 1.50 XX N9b1 - - JpHi01 3,685-3,850; 3,768 0.88 1.45 XX M7a1a - - Kofun JpIw32 1,347-1,409; 1,378 4.80 0.41 XY B5a2a1b O3a2c - JpIw31 1,303-1,377; 1,340 1.44 0.63 XX D5c1a - - JpIw33 1,295-1,355; 1,325 1.54 0.75 XX M7b1a1a1 - - Previously published Jomon F23 3,550-3,960; 3,755 34.82 1.20 XX N9b1 - Kanzawa- Kiriyama et al. (2019) F5 - 3.74 2.45 XY N9b1 D1b2b IK002 2,418– 2,720; 2,569 1.85 0.50 XX N9b1 - McColl et al. (2018) Yayoi Yayoi_1 - 0.01 2.92 XX M7a1a4 - Shinoda et al. (2019) Yayoi_2 1,931-2,001; 1,966 0.07 2.33 XY D4a1 O 33 Genetic distinction between different cultural periods Several broad analyses were conducted to get an initial idea of the genetic profile of the newly sequenced individuals and their affinities to other samples both within Japan and in the wider global dataset. The inter-individual variation within ancient and present-day Japanese was first explored by constructing a heatmap of the shared genetic drift between every combination of all individuals within the constructed SGDP dataset as measured using the statistic f3(Individual_1, Individual_2; Mbuti) (Fig. 1.7a). This analysis is able to clearly define three distinct clusters within the Japanese samples: Jomon, Yayoi, and a group containing Kofun individuals and present-day Japanese individuals. This separation indicates that each archaeological period has a distinct genomic profile with an implication of potential genetic continuity within Japan over the past 1,500 years. In spite of the large spatial and temporal variation in the Jomon dataset, notably high levels of shared drift are observed across all twelve individuals. The two Yayoi individuals also show a high affinity to each other and appear to be genetically closer to the Jomon than to the Kofun individuals. Consistent with the previous kinship analysis results (Appendix Fig. 1.2), none of the values between any pair of individuals is high enough to suggest any first- or second-degree relatives within the dataset. To further contextualise the level of shared drift within each of the different Japanese groups, the same pairwise outgroup f3 was calculated within a selected set of ancient and present-day populations (Fig. 1.7b). These results highlight the lack of genetic variation within the Jomon population, whose mean pairwise f3 value was higher than all other included ancient and present-day populations with the exception of the present-day Native American Surui and Karaitiana, who have been previously reported as being among the most homozygous modern human populations reflecting a small population size (Ceballos et al., 2018). The two individuals associated with the Yayoi period also share a high degree of drift with each other; this value may be influenced by their previously reported Jomon ancestry (Shinoda et al., 2019) or by the fact that two individuals were sampled from the same site. In contrast, the three Kofun individuals, who were also sampled from a single site, exhibit a level of diversity that is comparable to other ancient and present-day East Asian populations. 34 (a) (b) Figure 1.7. Pairwise outgroup-f3 statistics analysis. (a) A heatmap of pairwise outgroup f3- statistics comparisons between all ancient and modern Japanese individuals. Jomon samples are marked in red, Yayoi in gold, Kofun in blue, and present-day Japanese in green. (b) Each boxplot showing the values of pairwise comparisons of all individuals in a number of ancient and present-day populations. The average pairwise value for Jomon is highlighted as a vertical line. 35 The genome-wide autosomal affinities between ancient Japanese individuals and other continental Asian populations were established using principal component analysis (PCA). Ancient individuals were projected onto the genetic variation of present-day populations in the SGDP dataset from South and Central Asia, Southeast and East Asia, and Siberia (Fig. 1.8; Appendix Fig. 1.3). Once again, ancient Japanese individuals separated into their respective cultural designations along PC1. All Jomon individuals form a tight cluster, appearing distinct from other ancient populations, as well as present-day Southeast and East Asians, suggesting a unique genetic profile among the other populations. The two Yayoi individuals appear next to this Jomon cluster, consistent with their previously reported morphological similarity and genetic ancestry (Shinoda et al., 2019). A clear shift towards East Asian populations implies the presence of additional continental ancestry in Yayoi. PC2 shows a geographic cline of ancient East Eurasians from the south to the north ancient East Eurasian populations (Southern China, Yellow River, Northern China, West Liao River, Devil’s Gate Cave, Amur River, and Baikal). The three individuals from the Kofun period fall within the diversity of the Yellow River cluster. 36 Figure 1.8. Principal component analysis (PCA) of present-day East Eurasia with projected ancient individuals. Principal component analysis (PCA) visualising ancient Japanese individuals (i.e., Jomon, Yayoi, and Kofun) and continental ancient individuals (presented as coloured symbols consistent with the topographic map above) projected onto 112 present-day East Eurasians (represented by gray circles with Japanese highlighted in dark green). 37 An ADMIXTURE analysis using the “Human Origins Array” (HOA) data panel was conducted to identify ancestral components in each individual with the specified number of ancestral components (denoted as K) ranging from 2 to 12 (highlighted results in Fig. 1.9; full results in Appendix Fig. 4-5). In this analysis, a distinguishable Jomon component is observed from K=9 onwards (as represented by the colour red), which again highlights its unique profile within Asia. This component is also visible at high levels in “Yayoi_2” (“Yayoi_1” was excluded from this analysis for not meeting the minimum SNP threshold, 100,000 SNP sites) and at a reduced level in Kofun and Japanese. This Yayoi individual also features a number of other ancestral components not seen in Jomon; these proportions appear similar to the profile seen in the Amur River Basin and surrounding areas including a larger component that is dominant in Northeast Asians (represented by light blue) and another smaller component that represents much broader East Asian ancestry (represented by yellow). This East Asian component becomes much more dominant in the Kofun period and the modern Japanese population. These three initial analytical approaches create a consistent narrative about genomic changes within Japan over the past ~9,000 years. The Jomon appear to be a distinctive population within ancient and present-day Asia and seem to exhibit a low diversity within all twelve individuals in spite of representing a wide temporal and geographic range. While the individuals from the Yayoi period show a high affinity for the Jomon population, there are still clear differences in its relationship with continental populations and the composition of its ancestral profiles. The individuals from the Kofun period appear to be strongly distinct from those from the preceding cultures in all analyses, showing a higher affinity to present-day East Asians and lower shared genetic drift; they do, however, appear to retain a lingering Jomon ancestry. Overall, these results provide strong evidence that the major cultural changes in Japan were accompanied by changes in the region’s genomic profile. The next few sections will look at each of the Jomon, Yayoi and Kofun populations individually to further characterise their distinctive genetic identity. 38 Figure 1.9. Highlighted results from ADMIXTURE analysis. A subset of the results of the complete ADMIXTURE are presented here, showing the ancestral breakdown for Japanese and Ancient East Eurasian individuals (representing K = 11; complete pictures for all present-day and ancient individuals from K = 2 to K = 12 are presented in Appendix Fig. 1.5). The middle and bottom rows show selected Ancient East Eurasian populations from each geographical regions: Southern China (from left to right: Liangdao1, Liangdao2, Xitoucun), Yellow River (YR) (Middle Neolithic, YR_MN; Late Neolithic, YR_LN; Late Neolithic, Upper_YR_LN; Late Bronze Age/Iron Age, YR_LBIA; Iron Age, Upper_YR_IA), Northern China (Yumin, Bianbian, Boshan, Xiaogao), West Liao River (WLR) (Middle Neolithic, WLR_MN; Late Neolithic, WLR_LN; Bronze Age, WLR_BA; Middle Neolithic individual from Haminmangha, HMMH_MN; Bronze Age individual with a different genetic background from WLR_BA, WLR_BA_o), Amur River (AR) (Early Neolithic, AR_EN; Iron Age, AR_IA and AR_Xianbei_IA), Baikal (Early Neolithic, Shamanka_EN and Lokomotiv_EN), and Central Steppe (Botai, CentralSteppe_EMBA, Okunevo_EMBA). This set of the results shows a distinct Jomon ancestral component (represented by red), a component common in ancient samples from the Baikal region and the Amur River basin (represented by light blue), a broad East Asian component (represented by yellow) and a gray component that is most dominant in the Central Steppe. 39 Deep divergence of the Jomon lineage by geographic isolation The separation of the Jomon from other populations in the previous analyses supports the idea that they form a distinct lineage among East Eurasians, as proposed in previous studies (Gakuhari et al., 2020, Kanzawa-Kiriyama et al., 2019). The depth of this divergence was further explored here through reconstructing the phylogenetic tree of the twelve Jomon grouped as a single population with 17 other ancient and present- day populations using TreeMix (Pickrell and Pritchard, 2012) with differing numbers of admixture events (Fig. 1.10). These results infer that the Jomon lineage emerged after the early divergences of Upper Paleolithic East Eurasians (Tianyuan and Salkhit) and ancient Southeast Asian hunter-gatherers (Hoabinhnian), but prior to the splitting off of other samples including present-day East Asians, an ancient Nepali (Chokhopani), hunter-gatherers from Baikal (Shamanka_EN and Lokomotiv_EN) and Chertovy Vorota Cave (Devil’s Gate Cave) in the Primorye Region, and a Pleistocene Alaskan (USR1). An interesting additional observation from this analysis is that phylogenetic results consistently infer gene flow from Jomon to the modern Japanese across all migration models tested, with genetic contributions ranging from 8.9% to 11.5% (Appendix Fig. 1.6). This is consistent with the mean Jomon component of 9.31% in the present-day Japanese individuals estimated in the ADMIXTURE analysis (Fig. 1.9), supporting a lingering genetic contribution to present-day Japanese populations. Figure 1.10. Phylogenetic analysis of the Jomon lineage. Maximum-likelihood phylogenetic tree reconstructed by TreeMix under a model of two migrations. The tree shows a relationship among ancient (bold) and present-day (italic) populations. Coloured arrows represent the migration pathways and signals of admixture among all datasets. The migration weight represents the fraction of ancestry derived from the migration edge. All other migration models from m = 0 to m = 5 are presented in Appendix Fig. 1.8. 40 To further interrogate the phylogenetic position inferred by TreeMix for Jomon, and to get a more comprehensive picture of its genetic affinities with other ancient and present-day Asian populations, a series of formal tests were conducted based on f4 statistics (Patterson et al., 2012). These tests are used to assess symmetry in allele sharing between all tested populations with the Jomon across different specified background populations, and are identified as being significantly closer to the Jomon (Z>3), significantly closer to the background population (Z<-3), or symmetrically related to both if there no significant difference in allele sharing (-3>Z>3). A wide- spanning analysis of the SGDP panel in the form of f4(Mbuti, X; Y, Jomon) - in which X refers to ancient and present-day populations in the SGDP panel and Y refers to a series of chosen background populations - show a clear difference in the overall pattern of genomic affinities based on the choice of background populations (Fig. 1.11a-e). Every East Asian population since the beginning of the Jomon period has a significantly higher affinity to Jomon compared to Hoabinhnian, but not when compared to Devil’s Gate Cave. This pattern suggests that the divergence of a population ancestral to Jomon occurred after that of the Southeast Asian Hoabinhnian and before the population from Devil’s Cave, consistent with the tree inferred by TreeMix. A previously proposed model in which Jomon were a mixture of Hoabinhnian and East Asian-related lineages (McColl et al., 2018, Wang et al., 2021a), was not supported by direct testing with (f3(Jomon; Hoabinhnian, DevilsCave_N) = 0.193, Z = 61.355), or through two-way modelling using qpWave. (Appendix Table 1.6). These results support a phylogeny in which three distinct hunter-gatherer lineages emerged in East Eurasia. An unexpected result from f4(Mbuti, X; Y, Jomon) is that the 31.6kya Upper Paleolithic “Yana_UP” pair of individuals (Sikora et al., 2019) are the only population that are significantly closer to Jomon than Han, Dai, or Japanese respectively (Z > 3.366). (Fig. 1.11c-e). This affinity is still detectable even if when the reference populations is replaced with the other Southeast and East Asians (Appendix Table 1.7), supporting potential genetic exchange between the ancestors of Jomon and Ancient North Siberians, a lineage whose ancestry was widespread in North Eurasia prior to the LGM (Sikora et al., 2019, Massilani et al., 2020). A genetic affinity to the ~3.3kya Namazaga_CA population from South Turkmenistan (de Barros Damgaard et al., 2018) is also observable when Han or Dai are used as a background population. This 41 result is unexpected, given the geographic distance between this population and the Japanese archipelago and the relatively recent age of the population. It is possible that this detected affinity may be unduly influenced by technical biases in how this test was conducted, and that the relationship between the two ancient populations of pseudohaploid individuals (Jomon and Namazaga_CA) was overestimated when compared to modern diploid reference populations, rather than reflecting a true relationship between these populations. (a) f4(Mbuti, X; Hoabinhian, Jomon) (b) f4(Mbuti, X; DevilsCave_N, Jomon) (c) f4(Mbuti, X; Japanese, Jomon) 42 (d) f4(Mbuti, X; Han, Jomon) (e) f4(Mbuti, X; Dai, Jomon) Figure 1.11. Geographic and tem