Ancestral Legacies of an Insular Archipelago 

A genomic exploration of prehistoric  

Japan and mainland Asia 

 
Niall Cooke 

 
A thesis submitted for the degree of 

 Doctor of Philosophy 

 
School of Medicine, 

Trinity College Dublin 

 
March 2022


i 

 
Declaration 
 

I declare that this thesis has not been submitted as an exercise for a degree at this or 

any other university and it is entirely my own work.  

I agree to deposit this thesis in the University’s open access institutional repository or 

allow the Library to do so on my behalf, subject to Irish Copyright Legislation and 

Trinity College Library conditions of use and acknowledgement.  

I consent to the examiner retaining a copy of the thesis beyond the examining period, 

should they so wish (EU GDPR May 2018). 

 
 _____________________________  

Niall Pádraig Cooke         

March 2022  


ii 

 
Abbreviations 
 

ABC  - Approximate Bayesian Computation  

BAM  - Binary Alignment Map 

aBF  - approximate Bayes factor  

ADH  - alcohol dehydrogenase 

aDNA  - ancient DNA 

ALDH  - aldehyde dehydrogenase 

aML  - approximated marginal likelihood  

ANE  - Ancient North Eurasian 

ANS  - Ancient North Siberian 

AP  - Ancient Paleosiberian 

AR  - Amur River 

BP  - before present 

bwa  - Burrows-Wheeler Aligner 

CDX  - Chinese Dai in Xishuangbanna  

CEU  - Northern Europeans from Utah 

CHB  - Han Chinese in Beijing, China 

CHS  - Han Chinese South  

EBA  - Early Bronze Age 

EDAR  - ectodysplasin A receptor 

EN  - Early Neolithic 

gatk  - Genome Analysis Toolkit 

Ga100k - GenomeAsia100K 

GP  - genotype probability 

HOA  - Human Origins Array 

IA  - Iron Age 

IBD  - identical-by-descent 

ISOGG  - International Society of Genetic Genealogy  

JPT  -  Japanese in Tokyo, Japan 

KHV  - Kinh in Ho Chi Minh City, Vietnam 

ky  - thousand years 

kya  - thousand years ago 

LBIA  - Late Bronze / Iron Age 

LGM  - Last Glacial Maximum 

LN  - Late Neolithic 


iii 

 
MED  - Medieval 

MN  - Middle Neolithic 

mtDNA - mitochondrial DNA 

NAT2  - N-acetylotransferase 2  

PCA  - Principal Component Analysis 

PCR  - polymerase chain reaction 

rCRS  - revised Cambridge Reference Sequence 

ROH  - runs-of-homozygosity 

SGDP  - Simons Genome Diversity Panel  

SNP  - Single Nucleotide Polymorphism 

UP  - Upper Paleolithic 

WLR  - West Liao River 

YR  - Yellow River 

YRI  - Yoruba in Ibadan, Nigeria 

 
Note – the IDs of many ancient sample referenced in this analysis consist of numbers and letters. The 
identity of and further information regarding these samples is contained in the main text and 
supplementary materials.  


iv 

 
List of Figures 
Figure 1.1. A timeline of pre-and protohistoric Japanese periods.  .......................................... 14 
Figure 1.2. Newly sequenced genomes from throughout the Japanese archipelago. ............ 17 
Figure 1.3. A topographic map showing locations of published ancient samples included in 
Chapter 1 analysis............................................................................................................................... 18 
Figure 1.4. Summary of bioinformatic pipeline.  ........................................................................ 20 
Figure 1.5. Simulation scheme for modified “Out-of-Africa” dispersal with adjusted 
parameters for divergence time (T) and population size for Jomon.  ........................................ 27 
Figure 1.6. Sampling locations, dates, and genome coverage of ancient Japanese 
individuals. .......................................................................................................................................... 31 
Figure 1.7. Pairwise outgroup-f3 statistics analysis. .............................................................. 34 
Figure 1.8. Principal component analysis (PCA) of present-day East Eurasia with projected 
ancient individuals. ............................................................................................................................ 36 
Figure 1.9. Highlighted results from ADMIXTURE analysis.  .................................................. 38 
Figure 1.10. Phylogenetic analysis of the Jomon lineage. ......................................................... 39 
Figure 1.11. Geographic and temporal display of  f4(Mbuti, X; Y, Jomon) results. ................ 42 
Figure 1.12. Runs of homozygosity (ROH) analysis of the 8.8kya Jomon individual 
“JpKa6904”. ........................................................................................................................................ 44 
Figure 1.13. Geographic and temporal display of differences in affinities between Jomon 
sub-periods with ancient and modern populations using f4 statistics. . ..................................... 46 
Figure 1.14. A comparison of outgroup-f3 results for the Jomon dataset grouped by the 
island of origin. ................................................................................................................................... 48 
Figure 1.15. Geographical map highlighting results from f4(Mbuti, X; Jomon, Yayoi).  ....... 50 
Figure 1.16. Genetic ancestry of the Yayoi period....................................................................... 51 
Figure 1.17. Correlation of shared genetic drift between the Yayoi and Jomon individuals 
with the geographic locations of Jomon sites. ............................................................................... 52 
Figure 1.18. Genetic changes in the Kofun period.  .................................................................... 53 
Figure 1.19. Genetic ancestry of Kofun individuals.  .................................................................. 54 
Figure 1.20. Dating admixture in the Kofun individuals using DATES. ................................. 56 
Figure 1.21. Geographic and temporal display of f4(Mbuti, X; Kofun, Japanese) results. .... 58 
Figure 1.22. Genetic ancestry of the Kofun and modern Japanese modelled as three-way 
admixture.  .......................................................................................................................................... 60 
Figure 1.23. Tripartite structure in southern Ryukyu Islands. ................................................. 62 
Figure 1.24. Genomic transitions in parallel with cultural transitions in pre- and 
protohistoric Japan. ........................................................................................................................... 68 
Figure 2.1. Assessing the performance of imputation using GLIMPSE. ................................. 85 
Figure 2.2. Overlapping heterozygous SNP sites produced by imputation. ........................... 86 
Figure 2.3. Haplotypic analysis of 116 imputed ancient Asian individuals.  ........................... 88 
Figure 2.4. Assessing geographic and temporal intra-variation within Jomon. .................... 93 
Figure 2.5. Adaptative allele frequencies in ancient Asian populations and modern-day 
references. . ......................................................................................................................................... 97 
Figure 2.6. Regional and temporal differences in EDAR heterozygosity in imputed Asian 
data.  ................................................................................................................................................... 100 
Figure 2.7. Tracing mutations in the alcohol metabolism pathway over time within Japan 
and China.  ........................................................................................................................................ 102 
Figure 3.1. A proposed model for the formation of southern Native Americans.  ............... 113 
Figure 3.2  Exploring ancestral sources in ancient and present-day South America. ......... 122 
Figure 3.3. Genetic ancestry of LagoaSanta modelled as two-way admixture.  ................... 123 


v 

 
Figure 3.4. Revised model of South American ancestry based on admixture modelling.  . 127 

 
List of Tables 

Table 1.1. Summary of the time series of ancient Japanese data. ............................................. 32 
Table 1.2. Comparing the tripartite admixture model to individual two-way models in 
southern Ryukyu Islands. ................................................................................................................. 63 
Table 3.1 An overview of certain ancient southern Native American individuals and 
populations. ...................................................................................................................................... 112 
Table 3.2. Testing of the fittings of models for the genetic ancestry of the ancient 
LagoaSanta. ....................................................................................................................................... 124 
Table 3.3. Testing of the fittings of models for the genetic ancestry of Karitiana and Surui.
 ............................................................................................................................................................ 126 
  

vi 

 
Research Outputs  
 

Research article: 

Ancient genomics reveals tripartite origins for Japanese populations. 
Niall P. Cooke*, Valeria Mattiangeli*, Lara M. Cassidy, Kenji Okazaki, Caroline A. 
Stokes, Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto 
Tomioka, Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, 
Takashi Gakuhari and Shigeki Nakagome 

Published in Science Advances (September 2021). (*Refers to equal contribution to 

this work.) 

 
Literature review: 

Fine-tuning of Approximate Bayesian Computation for human 

population genomics. 

Niall P. Cooke and Shigeki Nakagome 

Published in Current Opinion in Genetics and Development (July 2018). 

 
Conference posters and talks: 

A simulation study on the origin of natural selection in an admixed 

population. 

Niall P. Cooke, Daniel G. Bradley and Shigeki Nakagome 

Poster presentation at the Irish Society of Human Genetics Annual Scientific Meeting 

(ISHG) in Dublin, Ireland (September 2018), the 47th European Mathematical 

Genetics Meeting (EMGM) in Dublin, Ireland (April 2019) and at the Society for 

Molecular Biology and Evolution (SMBE) Annual Conference in Manchester, United 

Kingdom (July 2019). 

  
vii 

 
A time series of ancient Japanese genomes. 

Niall P. Cooke, Valeria Mattiangeli, Lara M. Cassidy, Kenji Okazaki, Caroline A. Stokes, 
Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, 
Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, Takashi 
Gakuhari and Shigeki Nakagome 

Poster presentation and lighting talk at Human Evolution- From Fossils to Ancient 

and Modern Genomes virtual conference (November 2021). 

 
Genomic insights into the insular hunter-gather-fisher Jomon of Japan. 

Niall P. Cooke, Valeria Mattiangeli, Lara M. Cassidy, Kenji Okazaki, Caroline A. Stokes, 
Shin Onbe, Satoshi Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, 
Akihiko Matsumoto, Masafumi Itoh, Yoshitaka Kojima, Daniel G. Bradley, Takashi 
Gakuhari and Shigeki Nakagome 

Talk scheduled to be given at the 13th Conference on Hunting and Gathering Societies 

in Dublin, Ireland (July 2022). 

 
Courses undertaken: 

 "Teaching and Supporting Learning as 

a Graduate Teaching Assistant", Trinity College Dublin (5 ECTs) 

 “Planning and Managing Your Research and Career”, Trinity College 

Dublin (5 ECTs) 


viii 

 
Acknowledgments  
The first person I would like to thank is my supervisor Shigeki Nakagome for giving 

this great opportunity, and for all of the guidance and advice over the past few years. 

He has also shown great patience and humour in every encounter, and his unrelenting 

dedication to this project has brought my work to a much higher standard. 

My thanks as well to my co-supervisor Dan Bradley, whose 3rd year literature review 

on Neanderthals all the way back in 2015 set me down a path of population and ancient 

genomics. His constant practical support made this project possible, and the 

enthusiastic and knowledgeable feedback throughout kept it on the right track. A great 

debt is also owed to Valeria Mattiangeli, whose skills in the ancient lab are world class, 

and who is responsible for the high quality of ancient Japanese data that made this 

project so unique. Lara Cassidy has also had an incredibly positive influence on this 

thesis, from her initially getting me set up on the server and explaining some of basic 

commands and analytical concepts that I would go on to use on a daily basis, to her 

tremendous contributions in shaping our paper and making it into a much more 

comprehensive piece of work.  

The entire extended Bradley-Cassidy-lab group have also been a fantastic and friendly 

resource in terms of knowledge, ideas, paper sharing and bioinformatic support. I  am 

grateful to have welcomed into their lab meetings, journal clubs and Christmas parties. 

My particular thanks to Bruno Ariano, whose continued communication on Slack has 

provided me with a lot of practical advice and reassurance, and who set a  high 

standard for IBD plots that forced me to up my game. I also want to thank Iseult 

Jackson and our student Caroline Stokes for the enjoyable and rewarding experience 

of co-supervising an undergraduate project. Madeleine Murray has also been a 

welcome addition to the Nakagome team, and her preliminary imputation work was 

of great value in getting that analysis up and running. 

I also greatly appreciate the work done by our many collaborators from Kanazawa 

University and throughout Japan in securing the precious ancient samples I was 

privileged to analyse - Takashi Gakuhari, Kenji Okazaki,  Shin Onbe, Satoshi 

Hatakeyama, Kenichi Machida, Kenji Kasai, Naoto Tomioka, Akihiko Matsumoto, 

Masafumi Itoh and Yoshitaka Kojima. I hope that I can meet at least some of you in 

person when I finally make it out to Japan soon. I also have a great appreciation for 


ix 

 
our collaborators in the GenomeAsia100K consortium, not just for sharing of 

resources, but also their continued communication and enthusiasm for my analysis.  

I am pretty sure that this is the first ancient genomics PhD project conducted within 

Trinity’s Department of Psychiatry, and or even in the School of Medicine – but 

nevertheless I have really enjoyed my time being based in the Trinity Translational 

Medicine Institute on the campus of St. James’s Hospital. I want to thank Marcus and 

Fíana, my neighbours in Office 1.17 (the Genomics office) for making it a fun place to 

work. The Neuropsychiatric Genetics Group and the Dept. of Psychiatry as a whole are 

a fantastic and inspiring bunch of people, and I always felt very welcome in spite of the 

fact that I was working on a completely different  type of project. I am grateful for the 

many friends I have made from around the building too, through the One O’Clock 

lunches and the TTMI Social Committee’s Payday PubDays. I also want to express my 

gratitude for the support and advice from the “Write Stuff” writing group (Emma, 

Gráinne, Stephen and Anna). 

Unfortunately, this PhD will forever be associated in my memory with certain global 

events that began in March 2020 which resulted in a long 18 month-slog of remote 

working in a small studio apartment. I don’t think this thesis could have been 

completed under these circumstances without help from my family and friends. I want 

to thank for my parents for their continued support throughout this PhD endeavour, 

being impressed when I try to explain what I am working on and for sporadically 

sending me money for takeaways. I want to thank my siblings as well, especially my 

sister Laura, as well as her husband Talha. Lockdown was made more bearable 

through the many remote games, quizzes and Strava activities, and I want to 

particularly extend my thanks to Mark, Max, Billy, Dr. Pearse and very-soon-to-be-Dr. 

Tim, and all the member of the various prolific group chats (such as Das Affenhaus, 

OK Zoomer, various iterations of “the Salty Spittoon”, and Live Lab Love). 

 
ありがとうございます 

(“thank you”) 

 
x 

 
Table of Contents 
Declaration ......................................................................................................... i 

Abbreviations .................................................................................................... ii 

List of Figures ................................................................................................... iv 

List of Tables ...................................................................................................... v 

Research Outputs ............................................................................................. vi 

Acknowledgments ........................................................................................... viii 

Table of Contents ............................................................................................... x 

Abstract ............................................................................................................. 1 

General Introduction ......................................................................................... 2 

Chapter 1 - A time series of ancient genomes from the Japanese archipelago .. 13 

Introduction ......................................................................................................................... 13 

Materials and Methods ........................................................................................................ 16 

Results ..................................................................................................................................30 

Discussion ............................................................................................................................ 64 

Chapter 2 - A deeper GLIMPSE: Analysis of a large dataset of imputed ancient 
Asian individuals .............................................................................................. 71 

Introduction ......................................................................................................................... 71 

Materials and Methods ........................................................................................................ 78 

Results .................................................................................................................................. 83 

Discussion .......................................................................................................................... 103 

Chapter 3 - Exploring potential ancient Asian ancestry in southern Native 
Americans ....................................................................................................... 111 

Introduction ........................................................................................................................ 111 

Materials and Methods ....................................................................................................... 116 

Results ................................................................................................................................ 120 

Discussion .......................................................................................................................... 127 

Final Discussion ............................................................................................ 130 

Appendix ....................................................................................................... 132 

Appendix 1 – Supplementary material for Chapter 1 ........................................................ 132 

Supplementary Note 1.1 ........................................................................................................... 182 

Supplementary Note 1.2 .......................................................................................................... 183 

Supplementary Note 1.3 .......................................................................................................... 186 

Appendix 2 – Supplementary material for Chapter 2 ........................................................ 193 

Appendix 3 – Supplementary material for Chapter 3 ........................................................ 213 

References ..................................................................................................... 222 


1 

 
Abstract 
The objective of this thesis is to use ancient genomic data and a variety of bioinformatic 
approaches to explore human evolutionary history in the Japanese archipelago and 
mainland Asia. 

Chapter 1 presents a dense genomic exploration of Japanese prehistory through the 
analysis of time series of ancient genomes from different cultural periods, including 12 
ancient individuals which were sequenced as part of this project. This analysis reveals 
that the indigenous Jomon hunter-gatherer population diverged from other Asian 
populations ~20kya, before maintaining a small population size of ~1000 without any 
major significant contribution from outside of the archipelago. The isolation of the 
Jomon ended ~3kya with the arrival of people with northeast Asian ancestry during 
the Yayoi period, which is associated with the introduction of wet rice farming to the 
region. Surprisingly, a third wave of East Asian ancestry is observed during the 
imperial Kofun period. These three ancestral sources are demonstrated to characterise 
present-day Japanese populations, providing the first genetic evidence for a newly 
proposed tripartite structure for present-day Japanese populations. This structure is 
also shown to characterise a historical population from the southern Ryukyu Islands. 

Chapter 2 demonstrates through extensive testing that imputation can reliably 
increase the genomic resolution of ancient individuals from across Asia. IBD and 
ROH-based analyses of imputed data reveal broader population dynamics within 
mainland Asia as well as previously unobserved geographic substructure within the 
Jomon. Analysis of inferred adaptive alleles gives a direct insight into the evolution of 
traits specifically associated with Asian populations. The absence of a highly frequent 
and multifaceted mutation in the EDAR gene in the Jomon population suggests a more 
recent origin for this than previously proposed. Evidence for the atypical metabolism 
of alcohol is observed ~4kya, but it appears that separate mutations in the same 
pathway occurred at different stages. 

Finally, Chapter 3 explores potential early migrations from Asia into South America 
by searching for the ancient origin of “PopY”, an unknown population responsible for 
Australasian-like ancestry in certain Native American populations living in South 
America. Admixture modelling reveals the presence of ancestry from Ancient North 
Siberians and the Hoabinhian hunter-gatherers of Southeast Asia in the ~10kya  
“LagoaSanta” population from Brazil. Efforts to replicate this in present-day Karitiana 
and Surui are unsuccessful, due to the influence of more recent admixture events. 

  
2 

 
General Introduction  
 

Project overview 

The objective of this thesis is to explore the demographic and evolutionary history of 

humans who have lived in the Japanese archipelago and mainland Asia using ancient 

genomic data and state-of-the-art bioinformatic approaches. Populations from these 

regions have been historically underrepresented in large-scale genomic analyses 

(Popejoy and Fullerton, 2016), and until recent years have been almost entirely absent 

from the field of ancient genomics (Marciniak and Perry, 2017). The lack of available 

genetic resources has limited our understanding of how factors such migrations, 

environmental pressures, and advancements in agriculture and technology have 

shaped the present-day diversity observed in these areas today. Analysis of ancient 

genomic data over a wide time frame allows an unparalleled opportunity to directly 

explore the region’s changing genetic landscape, and new findings into the formative 

events in early human history are of core interest to numerous academic disciplines 

such as archaeology, anthropology, linguistics, human biology, and population 

genetics.  

This thesis will analyse both newly-sequenced and previously published ancient 

genomic data using current computational approaches to provide novel insights into 

the pre- and proto-history of Asia with three separate goals: 

1. Chapter 1 contains a comprehensive genomic overview of the Japanese archipelago 

through the analysis of a time series of ancient data. This includes the first 

reporting of twelve newly sequenced ancient genomes from the hunter-gatherer 

Jomon population and the previously unsampled imperial Kofun period. This 

chapter is adapted from the paper “Ancient genomics reveals tripartite origins of 

Japanese populations” (Cooke et al., 2021), published in Science Advances in 

September 2021. 

2. Chapter 2 explores the potential of imputation to increase the available genetic 

information from ancient Asian individuals. This chapter demonstrates for the first 

time that high rates of imputation accuracy can be obtained for ancient Asian 

populations, including the Jomon hunter-gatherer-fishers of Japan. A constructed 

dataset of imputed ancient individuals is then analysed to obtain a more detailed  


3 

 
overview of the genomic landscape of mainland Asia and the Japanese archipelago, 

and to explore origins of several adaptive traits associated with these regions. 

3. Chapter 3 uses ancient Asian data to explore the potential genetic legacy of early 

migrations into the Americas by focusing on the origin of a mysterious excessive 

genetic affinity between certain Native American populations living in South 

America with Australasian populations. 

Each chapter discusses its own specific background and methodology. This 

introduction will provide general information regarding the approaches and 

techniques that are throughout this project, and a general summary of the current 

genomic perspective of human evolution in Asia. It is divided into the three main 

sections: 1) an overview of how genomic variation can be used to explore the past; 2) 

an introduction to the field of ancient genomics; and finally, 3) our current 

understanding of the genetics of prehistoric Asia.  

 
Population Genetics: using variation to explore human evolution 

The ~3 billion base pair sequence of the human genome is overwhelmingly similar 

across all Homo sapiens, with any pair of individuals sharing approximately 99.5% of 

their genetic sequences (Griffiths et al., 2015); this, however, still leaves a huge number 

of potential differences within the remaining millions of sites. The wide range of 

phenotypic diversity observed across the world at both the individual and the 

population level is testament to the huge amount of genetic variation that has amassed 

and been maintained within our species over the course of its evolution. 

Understanding how these differences emerged is hugely informative to the movements 

and interactions of early human populations, as well as the roles environmental, 

cultural, and technological changes have played in shaping this variation.   

Over the past century the field of population genetics has studied the processes that 

shape and maintain genetic variation within groups of living organisms and 

contextualised them within mathematical and statistical frameworks. An important 

early principle known as “Hardy-Weinberg Equilibrium” established a simple 

mathematical proof that an allelic variant could be maintained at a constant frequency 

through successive generations in a hypothetical setting with several assumptions that 


4 

 
do not reflect real scenarios. This provided a baseline to explore four main processes 

that alter genetic frequencies within a population, and bring about differences in two 

separate populations: 1) genetic drift, which refers to random increase or decrease 

in genotype frequencies that occurs during the sampling of genotypes as they are 

passed from parent to offspring through successive generations; 2) mutations, which 

occur at a molecular level due to copying errors during genetic replication and 

introduce new alleles into a population; 3) natural selection, which allows 

mutations that confer an advantage to an organism in a particular setting to rise in 

frequency within a certain population over time; and 4) gene flow, which refers to 

the exchange of genetic material from one population to another through migrations 

or admixture introducing new variants into a population and altering the frequencies 

of pre-existing variants. 

The extent of these processes’ effect is intrinsically linked to population size (e.g., 

changes introduced through genetic drift may will have a larger impact in a small 

population) and time (e.g., an older mutation has a better chance of being at a higher 

frequency than a more recent one). The earliest mathematical models developed 

predicted how the frequency of an allele can be expected to change in frequency 

forwards in time.  A major conceptual shift occurred in the 1980s with the emergence 

of the theory of coalescence (often credited t0 John Kingman), which looks at allelic 

variation in the context of its evolutionary history and tries to explain how it emerged 

backwards in time (Hartl and Clark, 2007, Wakeley, 2009), which represented a 

breakthrough in how these models can be practically applied to real genetic data.  

In addition to theoretical frameworks, global human variation has been directly 

assessed using biological markers as far back as the 1910s, initially through the 

analysis of differences in blood markers from different populations (Hirschfeld and 

Hirschfeld, 1919). Following the identification of DNA as the molecular unit of 

inheritance in 1953 (Watson and Crick, 1953), it became the focal point for 

understanding genetic variation, even before it was possible to directly be sequenced - 

a rudimentary approach was developed that involved cutting up isolated DNA with 

site-specific restriction enzymes, and comparing the differences in fragment lengths 

(Danna and Nathans, 1971). Today, built upon a series of technological breakthroughs 

– the development of Sanger sequencing to directly decode DNA in the 1970s (Sanger 


5 

 
et al., 1977) the ability to rapidly amplify DNA fragments using polymerase chain 

reaction (PCR) in the 1980s (Mullis and Faloona, 1987), the publication of the first 

draft of the human genome in the early 21st century (Lander et al., 2001) and the huge 

reduction in cost per genome that has come about through the rise of next generation 

sequencing (NGS) technologies since  the 2010s (as reviewed by Metzker (2010))– the 

primary tools for assessing human genetic variation has been genome-wide data for 

large numbers of individuals. The analysis of genomic data from diverse populations 

evolved the field of “population genetics” into “population genomics”.  

Many approaches have been developed to showcase genetic diversity based on  

genome-wide differences in allelic information. One of the most popular methods has 

been principal component analysis (PCA) (Patterson et al., 2006), which simplifies 

high dimensional genetic data into synthetic variables that pull the most distantly 

related samples in a dataset apart, and whose patterns can resemble population 

structure and mirror geographic variability (Novembre et al., 2008). Other methods 

such as ADMIXTURE, can define unique ancestral coefficients within individuals 

(Alexander et al., 2009). At the population level, allele frequency information can be 

used create phylogenies (Pickrell and Pritchard, 2012) or directly estimate degrees of 

genetic drift or gene flow using  f- and D- statistics (Patterson et al., 2012), which can 

be further expanded into building admixture graphs (Patterson et al., 2012) or 

estimating ancestral sources (Haak et al., 2015). If two loci are located physically close 

to each other on a chromosome, they are less likely to be broken down via 

recombination and more likely be inherited together over successive generations. 

Measures called “linkage disequilibrium” represent the strength of the linkage 

between different genetic sites. At a broad genome level, these linkage patterns are 

informative about historic relationships between different populations, while at the 

molecular level, long or short stretches at particular sites can be informative about 

when a recent mutation occurred.  

Efforts to expand diversity in genomic resources has led to projects such as the 1000 

Genomes Project (1000 Genomes Project, 2015) and the Simons Genome Diversity 

Project (Mallick et al., 2016), whose aim to generate sequence data from many 

individuals and populations from throughout the world in order to better understand 

variation at a global level. Even as more genomic data becomes available from diverse 


6 

 
populations, there are still limitations, however, as to what this data can reveal about 

the complex story of human origins, and how the species arrived in different parts of 

the world. Many prehistoric human ancestries are not well represented by modern 

data, leaving any role they might have played in shaping variation undetectable. 

Similarly, events that have a major impact on genetic diversity, such as extreme 

population replacements or rapid expansions will skewer the ability to reconstruct 

much older events. The emergence of a new field to address these analytical 

shortcomings will be the focus of the next section. 

 
A brief overview of ancient genomics  

One of the most ground-breaking developments for exploring human history - and one 

of the most exciting areas of scientific discovery over the past decade - has been the 

rapidly expanding field of ancient genomics. The retrieval of ancient DNA (aDNA) 

from a historical artifact or long deceased organism provides a direct molecular insight 

into genetic variation at a particular point in the past. While the concept of this type of 

approach is not a new one - the earliest reported successful attempt was the 

sequencing of two small mitochondrial DNA fragments from a quagga (an extinct 

zebra-like animal) in the mid-80s (Higuchi et al., 1984) – the rise of high throughput 

sequencing technologies and improvements in protocols to overcome the inherent 

obstacles faced when working with ancient DNA has made findings in this once-niche 

field a regular sight in high impact scientific journals. 

There are three major challenges to working with ancient genetic material: 1) any 

aDNA that has survived as part of an ancient artifact over centuries or millennia can 

often be in such low quantities that it makes any effort to extract and amplify it 

particularly prone to contamination from modern DNA sources (which is a particular 

issue with ancient human remains, whose DNA can be indistinguishable from the 

researchers who worked with the sample) (Hofreiter et al., 2001b); 2) the small 

amount of aDNA that has survived has been degraded into much shorter and 

fragments rendering them difficult to successfully isolate (Paabo, 1989); and 3) the 

base pair sequence of these short aDNA fragments is susceptible to chemical damage, 

particularly to the deamination of cytosine bases to uracil, which can cause 

misincorporations during amplification that lead to unwanted error at transition 


7 

 
variant sites in the sequence data (Hofreiter et al., 2001a), which has been found to be 

particularity prevalent the ends of the aDNA fragments (Briggs et al., 2007).  

After nearly four decades of experimentation and research, strict protocols and 

innovative technical solutions have overcome and vastly reduced the impact of these 

issues, while the power of NGS sequencing platforms have made it possible to salvage 

sufficient amounts of ancient DNA to enable comprehensive analysis. Contamination 

has been limited through the carrying out of extractions in a dedicated sterile facility, 

with researchers wearing full body anti-contamination suits and taking extreme care 

to keep work stations clean (Cooper and Poinar, 2000). Protocols have been 

established to maximise the yield of endogenous aDNA (Der Sarkissian et al., 2014, 

Orlando et al., 2015, Damgaard et al., 2015), with a breakthrough finding that the 

petrous portion of the temporal bone (located at the base of the skull) gives 

systematically higher rates of yielded aDNA (Gamba et al., 2014, Pinhasi et al., 2015). 

The treatment of samples with uracil-DNA glycosylase (UDG) has also been found to 

counteract the effects of the deamination and reduce sequence errors (Hofreiter et al., 

2001a). Similarly, the bioinformatician responsible for analysing the data must always 

keep potential limitations in mind and adjust the in silico pipeline to reduce their 

impact, such as through relaxing parameters setting for the alignment of raw data to a 

reference genome to allow for some mismatches, or to restrict particular analyses to 

transversion variants only. 

Analysis of ancient human data has changed long held perceptions about early human 

evolution, such as by revealing the extent of the genetic contribution from early 

human-like hominids the Neanderthals (Green et al., 2010) and Denisovans (a sister-

group to the Neanderthals first recognised as its own species only through sequencing 

of ancient DNA  (Reich et al., 2010)). Dense sampling of ancient humans from 

different stages of West Eurasian prehistory has shown the profound effect that 

cultural changes have had on the genomic landscape, initially with the advancement 

with agriculture and then again by Bronze Age migrations of herders associated with 

pastoralism and dairy farming from the Western steppe region (Gamba et al., 2014). 

Samples from Asia, however, were limited to only a handful of genomes (Marciniak 

and Perry, 2017) until recent years, which have seen many large studies and datasets 

published.  An overview of the ancient genomic analysis of Asia is the focus of the next 

section. 


8 

 
The genetics of prehistoric Asia as told through aDNA 

This section will give an overview of the emerging picture of the prehistory of mainland 

Asia,  as revealed through the analysis of ancient genomes. 

 
The earliest inhabitants of East Eurasia 

It is unclear when exactly an anatomically modern humans first entered the Asian 

continent; widespread reporting of human teeth found in several caves in southern 

China that were dated as far back as 80 -120kya (Liu et al., 2015) have since been 

disputed, with alternative dating strategies and ancient DNA supporting a more recent 

arrival of 50 to 45kya (Sun et al., 2021, Hublin, 2021). It is generally accepted that all 

present-day non-African Homo sapiens derive the majority of their ancestry from a 

major expansion out of Africa into the Eurasian supercontinent 50,000 to 60,000 

years ago (kya) (Bergstrom et al., 2021). At some point after this expansion, these early 

humans admixed with Neanderthals (Green et al., 2010), likely on multiple occasions 

(Vernot et al., 2016) resulting in the exclusive presence of a small proportion of 

Neanderthal DNA in all non-Africans. After this split, it appears that Asian populations 

received a small amount of additional gene flow from Denisovans (Reich et al., 2010, 

Meyer et al., 2012, Prufer et al., 2017) 

There are only a handful of genome-wide sequenced Asian individuals from before the 

Last Glacial Maximum (LGM), a period of time 26 to ~19kya  in which the Earth’s ice 

sheets reached their maximum integrated volume (Clark et al., 2009). The oldest 

individual sequenced from this region is a 40,000-year-old individual (“TY”) from 

Tianyuan Cave, near Beijing, in China. This individual was found to be genetically 

closer to present-day Asians than Europeans and formed a unique ancestry that was 

basal to all modern populations with Asian ancestry (Yang et al., 2017). Tianyuan-

related ancestry was later determined to have been widespread prior to the LGM based 

on a high degree of shared genetic drift with a 33kya individual from the Amur River 

Basin in North-eastern China (Mao et al., 2021). A pair of Upper Palaeolithic 

individuals dating to ~31.6kya from the Yana Rhinoceros Horn Site in the extremes of 

northeast Siberia (referred to as “Yana_UP”), however, were found to share 

comparatively more alleles with West Eurasians than was observed in Tianyuan. This 


9 

 
led to their being designated as belonging to a unique “Ancient North Siberian” (ANS) 

lineage (Sikora et al., 2019). A 40kya genome from the Salkhit Cave in Mongolia was 

found to share ancestry related to both Tianyuan and the ANS lineage (Massilani et 

al., 2020) showing both of these early ancestries were widespread pre-LGM. 

 
The emergence of the ancestors of the Native Americans in Siberia 

At some point after the onset of the LGM (~26kya) a lineage ancestral to Native 

Americans emerged within Siberia; however, the understanding of the nature of the 

relationship between the early Asians and present-day Native Americans is under 

constant revision as more ancient genomes from the region are analysed.  

A 24kya individual from the Mal’ta Upper Paleolithic site (“MA1”) from near Lake 

Baikal in south-central Siberia was found to be closer to present-day Native Americans 

than East Asians, which was designated as part of an “Ancient North Eurasian” (ANE) 

lineage (Raghavan et al., 2014). This lineage was demonstrated to be descended from 

ANS lineage associated with “Yana_UP” with additional ancestry from early West 

Eurasian lineages (Sikora et al., 2019).   

A much more recent 9.8ky old Mesolithic individual  from northeast Siberia, 

“Kolyma1”, was more a closer relative for present-day Native American than the older 

MA-1 individual (Sikora et al., 2019). This individual was part of a separate lineage to 

ANE, termed the Ancient Palaeo-Siberian” (APS), which represented  a major genetic 

turnover in northeast Siberia from the older ANS lineage. A ~14kya individual  from 

the Ust-Kyakhta-3 site south of Lake Baikal (“UKY”)  has also been designated as 

belonging to the APS lineage (Yu et al., 2020). APS has been proposed as a mixture of 

ANS with additional ancestry from East Asian populations (Sikora et al., 2019), 

although it has also been suggested that a 17kya sample on the Lena River near Lake 

Baikal (“Khairygas-1”), which is genetically distinct from ANS, contributed heavily to 

AP (Kilinc et al., 2021). The East Asian component of this lineage may have been 

mediated in the Amur Region (Mao et al., 2021) prior to the expansion to the Americas. 

At the time of writing, the APS lineage stands as the closest relatives to Native 

Americans, although it cannot be ruled out that further sequencing from during or 

after the LGM may once again change this narrative. Multiple lines of evidence, 


10 

 
however, have suggested that the peopling of America was a dynamic process and not 

a single occurrence (Moreno-Mayar et al., 2018a, Ardelean et al., 2020, Becerra-

Valdivia and Higham, 2020, Bennett et al., 2021) as will be discussed in greater detail 

in Chapter 3. 

 
Insights from the growing post-LGM genomic record 

The number of genomes from northern East Asia dating to within the past 10k years 

that have been sequenced has grown exponentially in recent years, revealing a complex 

history of dynamic population movement and admixture between many distinct and 

diverse populations. The Upper Paleolithic lineages ANS and ANE lineages (as 

represented by “Yana_UP” and “MA-1” respectively) ultimately were replaced by the 

APS lineage (Sikora et al., 2019, Yu et al., 2020). A handful of present-day populations 

living in northeast Siberia, including the Chukchi, Koryak and Itelman, retain a close 

affinity to the APS, as do present-day Native Americans (Sikora et al., 2019); however 

continued migration in the mid to Late Holocene from southern East Asian 

populations resulted in the replacement of the APS with a population referred to as 

“Neo-Siberians”, who are ancestral to many contemporary  populations living in the 

region today (Sikora et al., 2019). 

Ancestry related to the 40kya TY individual was present in the Amur River region at 

least 33kya (Mao et al., 2021). A post-LGM individual dating to 19kya (“AR19k”), 

however, was found to be genetically distinct from these older samples and have a 

closer affinity to present-day populations in north East Asia, suggesting a major 

population change had occurred in this region during the LGM (Mao et al., 2021).  

A distinct genetic component that is associated with the Amur River Basin, located on 

the border of present-day Russian and China (also referred to as AR ancestry or 

northeast Asian ancestry) has been observed to have begun as early as 14kya (Mao et 

al., 2021). Neolithic foragers sampled from Chertovy Vorota Cave (Devil’s Gate Cave) 

in the Primorye Region of far eastern Russia have an affinity to this ancestry, as do 

present-day Tungusic-speaking populations (Siska et al., 2017, Sikora et al., 2019, Mao 

et al., 2021, Ning et al., 2020). This ancestry is broadly associated with ancient 

societies that practiced hunting, fishing and animal husbandry, with farming practices 


11 

 
only adopted in the historic era (Kuzmin, 2013, Ning et al., 2020). This lineage is 

genetically distinct from another deeply diverged population of hunter-gatherers 

known as the “Hoabinhian” living in Southeast Asia (McColl et al., 2018). 

In addition to the 24kya MA-1, the area surrounding Lake Baikal has been extensively 

studied through ancient human sequence data (de Barros Damgaard et al., 2018, Yu 

et al., 2020, Kilinc et al., 2021). The ANE related ancestry found in MA-1 was still 

present ~17kya in a sample from the Afontova Gora cave (“AG2”); however, it was 

largely replaced in the western Cis-Baikal area by the early Neolithic by North East 

Asian ancestry (de Barros Damgaard et al., 2018, Kilinc et al., 2021). In the Bronze Age 

however, a surge in ANE ancestry was observed, likely from an unsampled source 

population with a high affinity to MA-1(de Barros Damgaard et al., 2018). The more 

eastern Trans-Baikal area in contrast retained more longstanding continuity, while the 

vast north-eastern Yakutia also experienced an increase in affinity to northeast Asian 

ancestry over time (Kilinc et al., 2021). Dense sampling of the of Eastern Steppe region 

of present-day Mongolia, close to the Lake Baikal, has shown more dynamic changes 

in its ancestral profile with AR, ANE, the Bronze Age Baikal and Western Steppe 

ancestries fluctuating in proportions over the past ~5kya (Jeong et al., 2018, Jeong et 

al., 2020). 

At the beginning of the Neolithic period ~8ky, the broad population history 

throughout East Asia appears to be more diverse in the past than is observed in the 

present (Wang et al., 2021b, Yang et al., 2020). Many very old lineages appear to have 

been present in China that do not significantly contribute to present-day variation 

(Wang et al., 2021b). A clear separation into north and south East Asians is observed 

(Yang et al., 2020), and has possibly been seen as far back as 19kya (Mao et al., 2021).  

Ancient genomes of Neolithic populations from the Yellow River Basin in China were 

found to be clearly distinct from the identifiable ancestry of the more northerly Amur 

River Basin (Ning et al., 2020, Wang et al., 2021a). This region is associated with 

complex societies around millet farming by ~6kya, and Yellow River-related (YR) 

ancestral profiles are genetically similar to the present day East Asian profile (Ning et 

al., 2020). Neolithic populations sampled from the West Liao River, located between 

the Amur and the Yellow River Basins, have been shown to be influenced by both AR 


12 

 
and YR ancestries during periods, which corelated to changes in subsistence strategies 

(Ning et al., 2020). 

One of the most intriguing and mysterious parts of the Asian continent is the Japanese 

archipelago. The archaeological record shows a distinctive hunter-gatherer-fisher 

population emerging in ~16.5kya which preliminary sequence data has suggested as 

being a deeply diverged lineage from other East Asia populations (McColl et al., 2018, 

Kanzawa-Kiriyama et al., 2019, Gakuhari et al., 2020). This area’s geographic isolation 

from continental Eurasia, and subsequent series of rapid cultural changes make it a 

unique microcosm in which to study the migratory patterns that accompanied 

agricultural spread and economic intensification in Asia. A comprehensive analysis of 

ancient genomes from different periods of Japanese history in the context of other 

ancient Asian genomic data, and the impact of these early populations on present-day 

Japanese populations, will form the basis of Chapter 1 of this thesis. 

  
13 

 
Chapter 1 - A time series of ancient genomes from the 

Japanese archipelago 

Introduction  

The presence of stone tools has revealed that the Japanese archipelago has been 

occupied by humans for at least 38,000 years (ky), during what is referred to as its 

Palaeolithic period (Habu, 2004). The first distinctive culture in the region, however, 

does not occur until ~16.5kya, at approximately the end of Last Glacial Maximum 

(LGM), with the appearance of pottery shards with unique cord-like markings in the 

archaeological record. The period associated with these ceramics, which are among 

the oldest pottery in the world,  was termed the Jomon period (Jo translates to “cord” 

in Japanese and mon translates to “mark”) (Habu, 2004). The people who lived during 

this period (who are referred to as the “Jomon”) are considered to be a complex society 

of hunter-gatherers-fishers (Habu, 2004). The Jomon have been called “affluent 

foragers” and their associated culture is characterised by the presence of large 

settlements, a high population density per site, low residential mobility, logistically 

organized subsistence strategies (as evidenced by the presence of storage pits and large 

shell-middens), and the construction of large ceremonial features (Habu, 2004).  

The Jomon period lasted for several millennia until ~3kya with the introduction of 

wet-rice cultivation to the region, at the beginning of the Yayoi period, named after 

the area in Tokyo in which the first pottery associated with the period was discovered 

(Mizoguchi, 2013). This period saw major changes occurring across the entire region, 

not only with the widespread adoption of systematic rice paddy field agriculture, but 

also saw a hierarchical class structure emerging to maintain the communal 

collaboration required for food production (Mizoguchi, 2013). 

The Yayoi period was relatively quickly followed by another significant cultural 

change, which was represented by the first appearances of large keyhole-shaped 

tumuli  (Mizoguchi, 2013). The Kofun period (“Kofun” referring to a mounded tomb) 

began ~1.7kya and saw the centralisation of political power in the region and the 

beginnings of the imperial Japanese governance whose presence remains to this day. 

In addition to technological advances, this period also saw the appearance of the first 

written historical records and literary texts, such as the foundational Kojiki and Nihon 


14 

 
Shoki, which compiled myths and legends associated with 

the creation of the Japanese state and the reigns of the 

earliest emperors (Mizoguchi, 2013).  

How these distinctive cultural periods (a timeline of 

which can be seen in Fig. 1.1) have shaped the ancestry of 

present-day populations has long been a source of 

speculation. A well-known and enduring hypothesis is the 

“dual structure” model which was first proposed by 

Hanihara (1991) and posits that all Japanese are the 

admixed descendants of the indigenous Jomon with the 

migrants from East Asia who brought rice cultivation 

during the Yayoi period. This model was originally 

proposed based on the morphological analysis of crania 

and teeth, but has found further support from 

archaeology, biological anthropology and linguistics (as 

comprehensively reviewed by Hudson et al. (2020)). 

Broad genetic studies of Japanese populations have also 

shown clear population stratification across the 

archipelago, which also supports at least two population 

sources for present-day populations (Japanese 

Archipelago Human Population Genetics et al., 2012, 

Nakagome et al., 2015, Jinam et al., 2021a, Jinam et al., 

2021b). 

Direct sampling of ancient DNA from people living during 

these three distinct cultural periods has the potential to 

be hugely informative about the demographic origins of 

present-day populations, and of the genomic impact of 

the major cultural transitions to agriculture and later to early state phase. 

Unfortunately, the Japanese archipelago has not been an ideal region for the long- 

term preservation of ancient genetic material due to its humid climate and the high 

acidity of the soil found in the volcanic islands (Pinhasi et al., 2015, Hofreiter et al., 

2001b). In spite of this, three Jomon genomes with >1x coverage from the past ~3,500 

years have been successfully sequenced  (McColl et al., 2018, Kanzawa-Kiriyama et al., 

Figure 1.1 A timeline of pre-
and protohistoric Japanese 
periods. 


15 

 
2019, Gakuhari et al., 2020). This has shown that the Jomon population as being a 

deeply diverged lineage from other East Asians (McColl et al., 2018, Kanzawa-

Kiriyama et al., 2019, Gakuhari et al., 2020), with a broad range of between 18 and 

38kya being proposed for the timing of their split (Kanzawa-Kiriyama et al., 2017) and 

a potential Southeast Asia origin (McColl et al., 2018, Gakuhari et al., 2020). 

Additionally, preliminary genetic analysis of two samples related to the Late Yayoi 

period has demonstrated their genomic affinity both to the Jomon and to present-day 

Japanese (Shinoda et al., 2019). 

The limited number of sequenced ancient Japanese genomes impedes potential 

insights into the impact of cultural change on the region’s genetic profile, and how 

these changes shaped the ancestry of the present-day population. A denser sampling 

of Jomon would not only give further insight into the origin of this unique population 

within Asia, but it would also show how the structure of this population may have 

changed over the course of the thousands of years it dominated the Japanese 

archipelago. Integrating data from the subsequent agrarian Yayoi period, as well as the 

previously unsampled Kofun period, would demonstrate the unknown genetic impact 

of both the agricultural transition, and later the state-formation phase on shaping 

modern Japanese populations.  

 
Chapter goal 

The goal of this chapter is to analyse a uniquely assembled time series of ancient data 

from the Japanese archipelago with other relevant published ancient samples from 

throughout East Eurasia using up-to-date population genetics and bioinformatic 

methods. This chapter presents an analysis of ancient genomic data from throughout 

pre- and proto-historic Japan and includes the first reporting of 12 newly sequenced 

ancient genomes from across the archipelago. This approach represents a unique and 

original opportunity to characterise the changes in the genomic profile in Japan over 

the past ~9,000 years. 

This chapter is adapted from the paper “Ancient genomics reveals tripartite origins 

of Japanese populations” (Cooke et al., 2021), published in Science Advances in 

September 2021, and also contains updated analyses conducted after the paper’s 

publication. 


16 

 
Materials and Methods 

Materials 

Newly sequenced ancient Japanese genomes 

14 ancient Japanese individuals from 6 archaeological sites across the west and central 

parts of the Japanese archipelago were screened for the presence of ancient DNA as 

part of this project. Four of these sites were located on the island of Honshu 

(Funaguru, Iwade, Kosakus and Odake), and two were located on Shikoku (Hirajo and 

Kamikuroiwa) (Fig. 1.2a) All ancient individuals were associated and dated to the  

different stages of the Jomon period (Initial, Early, Middle, and Late Jomon), with the 

exception of three individuals from the Iwade site, who were associated with the Kofun 

period (further dating information for each individual is included in Appendix Table 

1.1). DNA was extracted from the petrous bone for all samples (an example of which is 

photographed in Fig. 1.2b) with the exception of “JpKa2” for whom only a tooth was 

available.  

Ancient samples were processed at dedicated ancient DNA facilities at Kanazawa 

University, Japan and Trinity College Dublin, Ireland. Wet-lab work conducted in 

Trinity College Dublin was led by Dr. Valeria Mattiangeli, with assistance in drilling 

for several samples by the present author. Additional libraries for samples “JpFu1” and 

“JpHi01”, which were sequenced on the NovaSeq 6000 Illumina platform, were 

prepared by final-year undergraduate student Caroline Stokes and included in 

analysis.   


17 

 
(a)        (b) 

 
(a) A 
map of the archaeological sites from which ancient individuals were sampled as part of this analysis 
(with names of major islands highted in italics) (b) A photograph of one of the petrous bones (from 
individual “JpFu1”) from which ancient DNA was successfully extracted. 

 
Published ancient genomic data 

Over the course of this project, many papers were published based on ancient 

genomics whose newly sequenced data were made publicly available. A large dataset 

of published ancient genome data deemed to be relevant to this analysis was 

assembled to broaden the scope of this study. This included ancient sequence data 

from individuals from the Central and Eastern Steppe (de Barros Damgaard et al., 

2018, Jeong et al., 2018), Northeast Siberia (Sikora et al., 2019), Southeast Asia 

(McColl et al., 2018) and continental East Asia (Ning et al., 2020, Yang et al., 2020). 

(Fig. 1.3; a full list of samples can be found in Appendix Table 2). 

This analysis was greatly enhanced by the inclusion of five additional previously 

published ancient samples from Japan (represented with triangles in Fig.  1.3): 

 two ~3.5kya individuals from the Late Jomon period excavated from the 

Funadomari site on Rebun Island, north of Hokkaido (“F5” and “F23”) 

(Kanzawa-Kiriyama et al., 2019) 

Figure 1.2. Newly sequenced genomes from throughout the Japanese archipelago. 


18 

 
 a ~2.8kya individual from the Final Jomon period from Ikawazu, Honshu 

(“IK002”) (McColl et al., 2018, Gakuhari et al., 2020) 

 two ~2kya individuals associated with the agrarian Yayoi culture from the 

north-western part of Kyushu Island (“Yayoi_1” and “Yayoi_2”), who date to 

the Late Yayoi Period. The skeletal remains of these individuals exhibits 

Jomon-like characteristics rather than immigrant types but other 

archaeological materials clearly support their association with the Yayoi 

culture, and genetic analysis has differentiated them from Jomon-associated 

individuals (Shinoda et al., 2019) 

Additional information about these Japanese samples is included in Table 1.1. 

 
Figure 1.3. A topographic map showing locations of published ancient samples included 
in Chapter 1 analysis. Ancient data are represented by symbols designated by geographic or cultural 
context. Ancient sampled from Japan are marked by triangles. Further information about processed 
ancient individuals can be found in Appendix Table 1.2.  


19 

 
Methods 

Processing of ancient data 

A detailed description of the drilling and extraction procedure followed in the 

generation of ancient sequence data is given in Supplementary Note 1. A summary of 

the bioinformatic pipeline used to process, align, and call variations from the raw 

ancient sequence data, along with certain validation steps, is visualised in Fig. 1.4. 

Adapters were removed from raw FASTQ sequence data using either one of two 

methods based on whether they were single or paired end:  

 adapters were trimmed from single end data using cutadapt version 1.9.1 

(Martin, 2011), allowing for a minimum overlap of 1 bp and removing reads that 

were shorter than 34 bp in length. 

 adapters were trimmed from paired end using AdapterRemoval v2 (Schubert et 

al., 2016), allowing for a minimum overlap of 1, removing “N”s and low quality 

bases in the sequence using the parameters–trimns and –trimqualities, 

filtering reads for minimum of 25 read length and 25 base quality score, and 

collapsing the paired end reads using –collapse. Only the outputs for 

“.collapsed”, “pair1.truncated” and “pair2.truncated” were included in 

subsequent processing. 

 
Trimmed reads were aligned to the hg19 / GRCh37 reference genome with the revised 

Cambridge Reference Sequence (rCRS) for mitochondrial DNA using bwa-aln in BWA 

version 0.7.5 (Li and Durbin, 2010) with relaxed parameters “-l 16500 -n 0.01 

-o 2”.  Aligned reads (i.e., BAM files) were filtered for a mapping quality of 20 and 

PCR duplicates were removed using samtools version 1.7 (Li et al., 2009a). Read group 

identifiers were added to each library using picard-tools version 1.101 (available at 

http://broadinstitute.github.io/picard/). The authenticity of ancient DNA was 

determined in non-UDG treated libraries screened using a MiSeq Illumina platform 

by looking at degradation patterns at 5’- and 3’-end respectively using MapDamage2.0 

(Jonsson et al., 2013). At this point an estimate was made of the endogenous human 

DNA contents of each library by calculating the percentage of the remaining reads 


20 

 
compared to the number of trimmed reads; samples showing >~20% were selected to 

be sequenced to a higher coverage. 

All libraries generated from the same individual were merged into a single BAM file 

using samtools version 1.7 (Li et al., 2009a) after the addition of read group identifiers. 

Indel realignment was performed on each individual using GATK version 3.7-0 

(McKenna et al., 2010). As the end of strands of ancient DNA are particularly prone to 

damage, the quality of two bases at the end of each read were manually reduced to a 

phred score of 2 (known as “softclipping”) to avoid calling genotypes from these 

regions. 

 
All ancient samples were processed, aligned, 
and had variants called using the same computational pipeline. 

Removing adapters 
(cutadapt for single 

end, AdapterRemoval 
for paired) 

Alignment to hg19 
with rCRS 

mitochondria 
sequence 
(bwa-aln) 

Mapping 
Quality 20  
(samtools) 

Remove PCR 
duplicates 
(samtools) 

Add read groups 
(picard-tools – 
AddOrReplace 
ReadGroups) 

Assessing 
degradation 

patterns in non-
UDG treated 

libraries 
(MapDamage) 

Calculation of 
endogenous 
human DNA 

Merge all 
libraries from 
one individual 

together 
(samtools) 

Indel 
Realignment 

(GATK – 
RealignerTarget 

Creator and 

Softclipping 
(manually reducing 

the quality score 
each read) 

Pseudohaploid SNP 
calling of all samples 

(GATK - Pileup) 

Diploid SNP calling of 
high coverage 

ancients 
(GATK - 

HaplotypeCaller) 

Figure 1.4. Summary of bioinformatic pipeline. 


21 

 
Incorporation of published ancient data into analysis 

Where possible, raw FASTQ files were downloaded from an online public data 

repository (e.g., the European Nucleotide Archive); in cases where FASTQ files were 

not available, BAM files were instead downloaded. To allow a direct comparison of 

published with newly generated data, all files were put through the pipeline as newly 

sequenced individuals (Fig. 1.4). Quality assessment of downloaded data was 

conducted with FastQC version 11.4 (Andrews, 2010). If detected, adapters were 

removed using AdapterRemoval version 2.2 (Schubert et al., 2016). BAM files were re-

aligned to the same reference genome using bwa-aln in BWA version 0.7.5 (Li and 

Durbin, 2010) using the same relaxed parameters. Throughout the analysis ancient 

individuals were grouped and primarily labelled by either geographic or cultural 

contexts. (Published data is visualised in Fig. 1.3 and further information for each 

sample is included in Appendix Table 1.2.) 

 
Contamination estimates and determining mtDNA haplogroup 

To assign mtDNA haplogroups to newly sequenced individuals and estimate the rate 

of mitochondrial contamination, trimmed fastqs files of newly sequenced individuals 

were additionally aligned to the rCRS reference mitochondrial genome. Aligned 

mitochondrial sequences for each individual were merged and processed using the 

same pipeline as the whole genome alignment. Consensus mitochondrial fasta 

sequences were determined for each individual based on a minimum depth of coverage 

of 5 and a base quality of 30 using mpileup and bcftools within the samtools package 

(Li et al., 2009a). Haplotypes based on these consensus sequences were determined 

using HAPLOFIND (Vianello et al., 2013).  Mitochondrial contamination rates were 

estimated using an approach previously outlined in Cassidy et al. (2020): the rates of 

secondary (i.e., non-consensus) bases observed at the haplotype-defining and private 

mutations defined by HAPLOFIND were calculated. The adjusted rate calculated after 

removing SNPs genotyped as C or G (as these sites are more likely to be affected by 

post-mortem damage) is also reported. 

 
Molecular sex determination 


22 

 
The molecular sex of each ancient sample was determined using the method outlined 

in Skoglund et al. (2013); in summary, the ratio of reads aligning to the Y chromosome 

relative to the total number aligning to either sex chromosome (Ry) was calculated and 

assigned as a female if Ry < 0.016 or male if Ry > 0.075. This ratio was calculated using 

reads filtered for a higher minimum mapping quality filter of 30. 

 
Relatedness 

Kinship for all ancient individuals (newly generated and  published) was estimated 

using “Relationship Estimation from Ancient DNA” (READ) (Monroy Kuhn et al., 

2018). In cases where a first-degree relationship was identified, the lower coverage 

related individual was removed from subsequent analysis. (Individuals removed as 

first-degree relatives in published data are listed in Appendix Table 1.2). 

 
Y chromosome haplogroup analysis 

All newly sequenced ancient individuals whose molecular sex was determined to be 

male were piled up to the list of SNPs listed in the ISOGG SNP using GATK version 

3.7-0 (McKenna et al., 2010). Only SNP sites with a minimum base quality of 30 were 

considered. 

 
Genotype calling and merging to published datasets 

To analyse ancient data in the context of present-day genetic variation, each ancient 

individual was genotyped for the bi-allelic SNP sites present on two different reference 

panels, which they were subsequently merged to:  

 
1. the Human Origin Array (HOA) consisting of 594,896 SNP sites for 1,963 

modern, ancient and reference genomes (Lazaridis et al., 2014) 

 
2. the Simons Genome Diversity Panel (SGDP) consisting of 278 modern, ancient 

and reference genomes (Mallick et al., 2016) that has been filtered for 

autosomal transversions-only SNPs with a minor allele frequency of 1%, which 


23 

 
left 3,867,366 SNP sites in the merged data. Transversion SNP sites can be 

more accurately called in data as ancient samples are more susceptible to 

transition mutations over time.  

 
Due to the limited nature of ancient DNA, with coverages often being below 1x, many 

analyses were restricted to “pseudohaploid” data, in which potential heterozygous 

information is disregarded, and only a single allele is sampled per site. At each SNP 

site a high-quality single base (bq30) was randomly called per position using GATK 

version 3.7-0 (McKenna et al., 2010) Sites with >2 differing calls were not included. 

Datasets were merged using PLINK v1.90b4.4 (Purcell et al., 2007). 

 
Throughout this study, the modern Japanese population is represented by either data 

from the “Japanese” populations found in the HOA or SGDP, or by the “JPT” (i.e., 

Japanese in Tokyo) in the 1000 Genome Project Phase 3 (1000 Genomes Project, 

2015). However, it is important to recognise that ancestral heterogeneity exists across 

the Japanese archipelago and that this diversity is not fully captured by these standard 

reference sets, and to keep this in mind when interpreting results. 

 
f-statistics  

f-statistics measure correlations in allele frequencies in sets of two (f2), three (f3) or 

four (f4) defined populations (Patterson et al., 2012). f3 statistics were calculated using 

qp3Pop (v300) and f4 using qpDstat (v662) with “f4 mode: YES”; both of these tools 

are found in the AdmixTools v6.0 package (Patterson et al., 2012). In both cases the 

population “Mbuti” was used as an outgroup. Standard errors in f4-statistics were 

calculated from the default block jackknife approach. Heatmaps for f3 values were 

created using the heatmap.2 package in R. 

 
Principal Component Analysis 

Principal component analysis (PCA) was conducted to see the positioning of ancient 

samples within present-day population genetic variation, using smartpca (v16000) 

(Patterson et al., 2006) with specified parameters of “killr2: YES”, “r2thresh: 0.2”, 

“numoutlieriter: 0”, “lsqproject: YES” and “autoshrink: YES”. Only a subset of the 


24 

 
larger dataset was included in PCA in order to just focus on Asian populations and 

ancient individuals. Present day populations (n=112) were chosen based on 

geography: 

 East Asia: Ami, Atayal, Dai, Daur, Han, Japanese, Korean, Lahu, Miao, 

Mongola, Oroqen, She, Tu, Tujia, Uygur, Xibo 

 Southeast Asia: Burmese, Cambodian, Dusun, Kinh, Thai, Yi 

 Central and South Asia: Balochi, Bengali, Brahmin, Brahui, Burusho, 

Hazara, Irula, Kalash, Kapu, Kusunda, Kyrgyz, Madiga, Makrani, Mala, Naxi, 

Pathan, Punjabi, Relli, Sindhi, Tajik, Yadava  

 Siberia: Aleut, Altaian, Chukchi, Eskimo (Chaplin), Eskimo (Naukan), Eskimo 

(Sireniki), Even, Hezhen, Itelman, Mansi, Russian, Tubalar, Ulchi 

 
These populations, along with all ancient individuals, were extracted from the larger 

dataset using PLINK v1.90b4.4 (Purcell et al., 2007). The subset of the data was 

converted from plink to EIGENSTRAT format using convert, part of the EIGENSOFT 

package (Price et al., 2006). Principal components were determined using the 

variation found within present-day individuals only with ancients projected onto this 

variation. Plotting was done in R using the output “smartpcaevec”. With the exception 

of “Yayoi_1”, only ancient individuals with at least 100,000 SNP sites were included 

in visualisation. The number of SNPs per individual was determined using plink --

missing. The percentage variation explained by each principal component was 

calculated using the output in the “smartpcaeval” file. 

 
ADMIXTURE 

The programme ADMIXTURE v.1.3.0 (Alexander et al., 2009) was used for 

unsupervised genetic clustering of present-day and ancient individuals in order to 

identify different ancestral components within the dataset. This analysis was 

conducted using a subset of the Human Origin Array panel, which included: 

 present-day populations from the South Asian, Southeast Asian, East Asian, 

Siberia, Oceania, and the Americans, with African and European ancestry 

being represented by Mbuti and Sardinia respectively (n = 786), and  

 ancient East Eurasian samples with at least 100,000 SNPs in the SGDP dataset 

(n = 187).  


25 

 
SNPs were pruned for sites in linkage disequilibrium using PLINK v1.90b4.4 (Purcell 

et al., 2007), with a sliding window size of 50 variants, a step size of 5 variants, and an 

r2 threshold of 0.2 (--indep-pairwise 50 5 0.2), leaving 189,487 SNPs for analysis. 10 

replicates with random seed were run for each number of clusters (K) from 2 to 12, 

and the run with the lowest cross validation was plotted. 

 
Phylogenetic analysis 

Maximum-likelihood trees were inferred under various admixture models using 

TreeMix (v. 1.13) (Pickrell and Pritchard, 2012). A subset of ancient and present-day 

populations in the SGDP dataset was chosen to represent different ancestors across 

East Eurasia: Ust_Ishim, Yana_UP, MA1, Tianyuan, Salkhit, Papuan, Hoabinhian 

(La368 and Ma911), Kusunda, Jomon, Chokhopani, Shamanka_EN, Lokomotiv_EN, 

DevilsCave_N, USR1, Han, Ami, and Japanese. The dataset was filtered for non-

missingness, which left 98,687 SNPs for analysis. Mbuti was used as an outgroup to 

root the tree and fitted models with no-migration or migrations from 1 to 5 to the data. 

A total of 1,000-1,500 replicates were run for each model, of which the tree with a 

likelihood the closest to a mean across the replicates was considered the most 

representative under a given model. 

 
Diploid calling and determining runs of homozygosity (ROH) 

Diploid genotype information was generated from the highest coverage individual 

generated by this study, the 8.8kya “JpKa6904” individual, and a number of other 

published high-coverage ancient genomes, including Yana1 (Sikora et al., 2019), USR1 

(Moreno-Mayar et al., 2018a), Loschbour, Stuttgart_LBK (Lazaridis et al., 2014),  

Mesolithic Irish (SRA62), and Neolithic Irish (JP14) (Cassidy et al., 2020).  (Further 

information about these samples is included in Appendix Table 1.3). 

Diploid genotypes were called using HaplotypeCaller, part of GATK version 3.7-0 

(McKenna et al., 2010) based on the Phase 3 v5 1000 Genomes release panel (1000 

Genomes Project, 2015)  filtered for transversion sites and a minor allele frequency of 

1%. These calls were further filtered for genotype quality 30 and minimum depth of 10 

using vcftools v0.1.13 (Danecek et al., 2011). Transversion SNPs common to each 


26 

 
diploid ancient sample (total 660,027 SNPs) were included in runs of homozygosity 

(ROH) analysis. ROH was measured in each diploid genome using PLINK v1.90b4.4 

(Purcell et al., 2007) with the following options: -homozyg --homozyg-density 

50 --homozyg-gap 100 --homozyg-kb 500 --homozyg-snp 50 --

homozyg-window-het 1 --homozyg-window-snp 50 --homozyg-window-

threshold 0.05 

 
Demographic modelling with ROH 

Approximate Bayesian Computation (ABC) is a statistical framework in which the fit 

between observed data and that generated through simulation is quantified. This 

approach has been widely used in inferring human demography, as reviewed in Cooke 

and Nakagome (2018). ABC-based methods capture characteristics of genetic data 

though summary statistics. Here, the likelihoods of different potential combinations 

of population size (N) and split time (T) of the Jomon lineage were approximated by 

fitting genome-wide patterns of runs-of-homozygosity (ROH) generated with 

simulation data to those observed in the oldest individual in the Jomon dataset, 

“JpKa6904”. The ABC-based analysis presented here and published in Cooke et al. 

(2021) was conceived by Dr. Shigeki Nakagome and conducted in collaboration with 

him. 

The filtered diploid JpKa6904’s genome left 984,740 autosomal SNPs for ROH 

analysis (see “Diploid calling and determining runs of homozygosity (ROH)” for 

diploid processing and ROH parameters). The ROH profile was summarized into a 

spectrum of ROH fragments ranging from 0.5 to 100 Mb with a bin size of 1 Mb. 

100 whole-genome simulations were generated using a coalescent simulator ms 

(Hudson, 2002) under a fixed scenario of the Out-of-Africa dispersal reconstructed 

from a previous study (Gutenkunst et al., 2009) with slight modifications (as 

visualized in Fig. 1.5). This simulation assumed 25 years per generation and 1.25×10-8 

per generation (Scally and Durbin, 2012) and 0.625×10-8 per generation for a 

mutation rate and a recombination rate respectively. A broad search for the parameter 

space was first performed that is defined with the population size (N) from 500 to 

2,500 and population split from 10 to 40kya (T) using the data from chromosomes 3 


27 

 
to 22 that includes likely scenarios for further testing with all autosomal 

chromosomes.  

 
Figure 1.5. Simulation scheme for modified “Out-of-Africa” dispersal with adjusted 
parameters for divergence time (T) and population size for Jomon. Simulation data was 
generated using ms (Hudson, 2002) to reflect the possible scenarios for when Jomon diverging from 
the rest of Asia and how big was their population at that time. 

 
Likelihoods of observing the ROH spectrum similar to that observed in JpKa6904 

were estimated using the method developed in Osada et al. (2013) that measures the 

similarity of summary data (i.e. ROH spectrum) between the observed and simulated 

data based on a kernel density estimate and that calculates an approximated marginal 

likelihood (aML) of a given model. A normalized Gaussian kernel function with a 

bandwidth of 1 was used to compute kernel density estimates of aMLs. These estimates 

were then compared to identify the best-fitting model as an approximate Bayes factor 

(aBF); the aBFs were calculated between a model with the highest aML and any other 

models.  

 
28 

 
Admixture modelling 

Admixture events were modelled using qpWave v600 and qpAdm v1000 in the 

AdmixTools v6. 0 package (Patterson et al., 2012, Haak et al., 2015) using the merged 

merged datasets of ancient East Eurasians and the SGDP populations. This analysis 

used transversion sites with global minor allele frequencies >1% and used the 

parameter option of “allsnps: YES”.  

“Right” populations included as outgroups in the analysis consisted of 8 Eurasian 

populations: Sardinian (n = 3), Kusunda (n = 2), Papuan (n = 14), Dai (n = 4), Ami (n 

= 2), Naxi (n = 3) (26), Tianyuan (n = 1) (Yang et al., 2017), Chokhopani (n = 1) (Jeong 

et al., 2016), and Mal’ta (n = 1) (Raghavan et al., 2014), and this was conducted both 

with and without an additional subset of Jomon individuals (JpKa6904, JpOd181, and 

IK002).  

Populations chosen to be investigated as potential sources of ancestry based on f4 

statistics and were only considered as sources of an admixture event only if the 

admixture between the sources is supported from modelling both with and without a 

subset of Jomon in the right populations. 

Nested p-values comparing the likelihoods of two models were calculated in R (i.e., 

does Model A fit better than Model B) using the formula: 1 − 𝑝𝑐ℎ𝑖𝑠𝑞(χெ௢ௗ௘௟_஺
ଶ −

χெ௢ௗ௘௟_஻
ଶ , 𝑑𝑜𝑓 =  𝑑𝑜𝑓ெ௢ௗ௘௟_஺ − 𝑑𝑜𝑓ெ௢ௗ௘௟_஻, in which “dof” refers to its degree of freedom. 

 
Dating admixture events by DATES 

DATES v753 (Narasimhan et al., 2019) was used in this analysis to estimate the time 

of two different admixture events in the Kofun individuals: 

 Admixture between Jomon and Northeast Asian ancestry – in this 

model, Northeast Asian ancestry was represented by West Liao River 

individuals “WLR_BA_o” and “HMMH_MN”  

 Admixture between Jomon and East Asian ancestry – in this model 

East Asian ancestry was represented by Han Chinese in Beijing (CHB) in the 

1000 Genome Phase 3 panel (1000 Genomes Project, 2015)   


29 

 
The estimated date in generation was converted into years with the assumption of 25 

years per generation, which was further added into a mean age of median dates from 

three Kofun individuals (i.e., 1,348 years before present). The parameters used were: 

binsize: 0.001, maxdis: 1.0, runmode: 1, mincount: 1, and 

lovalfit: 0.45. The standard error was estimated from a weighted block jackknife 

method. 

 
Testing genetic continuity 

A strict test for genetic continuity was conducted between individuals associated with 

the Kofun period and present-day Japanese using the method outlined by Schraiber 

(2018).  


30 

 
Results  

Newly sequenced ancient genomes from pre- and protohistoric Japan 

A sufficiently high level of endogenous human DNA was found to be preserved in 12 of 

the total 14 sampled ancient specimens to warrant further shotgun sequencing to 

higher coverage. Non-UDG treated libraries showed post-mortem damage patterns 

(Appendix Fig. 1.1) consistent with that expected in ancient DNA. Whole genome 

coverages obtained from these 12 samples ranged from 0.88× to 7.51×. (Fig. 1.6 and 

Appendix Table 1.1) and the estimated level of modern human contamination was 

determined to be suitably low so as not to influence downstream analysis (<2.15%) 

(Appendix Table 1.4).  

Of the 12 individuals, 8 were determined to be female (6 Jomon and 2 Kofun) and 4 

male (3 Jomon and 1 Kofun) (Appendix Table 1.5). Kinship analysis confirmed that 

none of the individuals were first- or second-degree relatives (Appendix Fig. 1.2).  

Analysis of uniparental markers showed a clear distinction into Jomon or Kofun 

samples (Table 1.1). Mitochondrial haplogroups for all Jomon individuals belong to 

either the N9b or M7a clades, which have been firmly associated with this population 

in previous studies (Kanzawa-Kiriyama et al., 2017, McColl et al., 2018, Gakuhari et 

al., 2020, Kanzawa-Kiriyama et al., 2019, Adachi et al., 2011); these haplotypes are 

rare outside of Japan in present-day populations (Tanaka et al., 2004). The three 

Jomon males belong to the Y chromosome haplogroup D1b1, which is present in 

modern Japanese populations but almost absent in other East Asians (Hammer et al., 

2006). In contrast, the Kofun individuals all belong to mitochondrial haplogroups that 

are common in present-day East Asians (Zheng et al., 2011), while the single Kofun 

male has the O3a2c Y chromosome haplogroup, which is also found throughout East 

Asia, particularly in mainland China (Wang and Li, 2013).  

  
31 

 
(a)      (b) 

 
Figure 1.6. Sampling locations, dates, and genome coverage of ancient Japanese 
individuals. (a) The individual locations for all ancient Japanese data are shown across the Japanese 
archipelago. Sites are marked with circles for individual genomes that were newly sequenced from this 
project and triangles if previously reported (Table 1.1 and Appendix Table 1.1). The colours represent 
three different periods of Japanese pre-(Jomon and Yayoi) and protohistory (Kofun). (b) Each 
sequenced individual is plotted with whole-genome coverage on the x-axis and median age (years before 
present) on the y-axis. The nine Jomon individuals are further split into five different sub-periods based 
on their ages (see Supplementary Note 1): Initial (JpKa6904), Early (JpOd274, JpOd6, JpOd282, 
JpOd181, and JpFu1), Middle (JpKo2), Late (JpKo13, JpHi01, F23, and F5), and Final (IK002). 


32 

 
 Table 1.1. Summary of the time series of ancient Japanese data.  

A total of 17 ancient Japanese individuals were included in this analysis. 

  
Associated 
culture 

Sample ID 

Date range 
and 
median 
(cal BP) 

Coverage 
mtDNA 
contamination 
rate (%) 

Molecular 
sex 

mtDNA 
haplogroup 

Y chr 
haplogroup 

Ref. 

Newly sequenced in this study 

Jomon 

JpKa6904 
8,646-8,991; 
8,819 

7.51 1.46 XX N9b3 - - 

JpOd274 
6,119-6,289; 
6,204 

1.56 1.13 XY M7a D1b1d1 - 

JpOd6 
5,934-6,179; 
6,057 

1.18 1.55 XX N9b3 - - 

JpOd181 
5,751-5,917; 
5,834 

1.83 0.91 XY N9b1 D1b1d1 - 

JpOd282 
5,737-5,902; 
5,820 

0.96 1.38 XY M7a1 D1b1d1 - 

JpFu1 
5,478-5,590; 
5,534 

1.13 2.15 XX M7a1 - - 

JpKo2 
4,294-4,514; 
4,404 

2.47 1.44 XX N9b - - 

JpKo13 
3,847-3,978; 
3,913 

1.81 

 
1.50 XX N9b1 - - 

JpHi01 
3,685-3,850; 
3,768 

0.88 1.45 XX M7a1a - - 

Kofun 

JpIw32 
1,347-1,409; 
1,378 

4.80 0.41 XY B5a2a1b O3a2c - 

JpIw31 
1,303-1,377; 
1,340 

1.44 0.63 XX D5c1a - - 

JpIw33 
1,295-1,355; 
1,325 

1.54 0.75 XX M7b1a1a1 - - 

Previously published 

Jomon 

F23 
3,550-3,960; 
3,755 

34.82 1.20 XX N9b1 - Kanzawa-
Kiriyama et al. 
(2019) 

F5 - 3.74 2.45 XY N9b1 D1b2b 

IK002 
2,418–
2,720; 2,569 

1.85 0.50 XX N9b1 - 
McColl et al. 
(2018) 

Yayoi 

Yayoi_1 - 0.01 2.92 XX M7a1a4 - 

Shinoda et al. 
(2019) 

Yayoi_2 
1,931-2,001; 
1,966 

0.07 2.33 XY D4a1 O 


33 

 
Genetic distinction between different cultural periods 

Several broad analyses were conducted to get an initial idea of the genetic profile of 

the newly sequenced individuals and their affinities to other samples both within 

Japan and in the wider global dataset. The inter-individual variation within ancient 

and present-day Japanese was first explored by constructing a heatmap of the shared 

genetic drift between every combination of all individuals within the constructed 

SGDP dataset as measured using the statistic f3(Individual_1, Individual_2; Mbuti) 

(Fig. 1.7a). This analysis is able to clearly define three distinct clusters within the 

Japanese samples: Jomon, Yayoi, and a group containing Kofun individuals and 

present-day Japanese individuals.  This separation indicates that each archaeological 

period has a distinct genomic profile with an implication of potential genetic 

continuity within Japan over the past 1,500 years. In spite of the large spatial and 

temporal variation in the Jomon dataset, notably high levels of shared drift are 

observed across all twelve individuals. The two Yayoi individuals also show a high 

affinity to each other and appear to be genetically closer to the Jomon than to the 

Kofun individuals. Consistent with the previous kinship analysis results (Appendix Fig. 

1.2), none of the values between any pair of individuals is high enough to suggest any 

first- or second-degree relatives within the dataset. 

To further contextualise the level of shared drift within each of the different Japanese 

groups, the same pairwise outgroup f3 was calculated within a selected set of ancient 

and present-day populations (Fig. 1.7b). These results highlight the lack of genetic 

variation within the Jomon population, whose mean pairwise f3 value was higher than 

all other included ancient and present-day populations with the exception of the 

present-day Native American Surui and Karaitiana, who have been previously 

reported as being among the most homozygous modern human populations reflecting 

a small population size (Ceballos et al., 2018). The two individuals associated with the 

Yayoi period also share a high degree of drift with each other; this value may be 

influenced by their previously reported Jomon ancestry (Shinoda et al., 2019) or by 

the fact that two individuals were sampled from the same site. In contrast, the three 

Kofun individuals, who were also sampled from a single site, exhibit a level of diversity 

that is comparable to other ancient and present-day East Asian populations. 


34 

 
(a) 

 
(b) 

 
Figure 1.7. Pairwise outgroup-f3 statistics analysis. (a) A heatmap of pairwise outgroup f3-
statistics comparisons between all ancient and modern Japanese individuals. Jomon samples are 
marked in red, Yayoi in gold, Kofun in blue, and present-day Japanese in green. (b) Each boxplot 
showing the values of pairwise comparisons of all individuals in a number of ancient and present-day 
populations. The average pairwise value for Jomon is highlighted as a vertical line.  

 
35 

 
The genome-wide autosomal affinities between ancient Japanese individuals and 

other continental Asian populations were established using principal component 

analysis (PCA). Ancient individuals were projected onto the genetic variation of 

present-day populations in the SGDP dataset from South and Central Asia, Southeast 

and East Asia, and Siberia (Fig. 1.8; Appendix Fig. 1.3). Once again, ancient Japanese 

individuals separated into their respective cultural designations along PC1. All Jomon 

individuals form a tight cluster, appearing distinct from other ancient populations, as 

well as present-day Southeast and East Asians, suggesting a unique genetic profile 

among the other populations. The two Yayoi individuals appear next to this Jomon 

cluster, consistent with their previously reported morphological similarity and genetic 

ancestry (Shinoda et al., 2019). A clear shift towards East Asian populations implies 

the presence of additional continental ancestry in Yayoi. PC2 shows a geographic cline 

of ancient East Eurasians from the south to the north ancient East Eurasian 

populations (Southern China, Yellow River, Northern China, West Liao River, Devil’s 

Gate Cave, Amur River, and Baikal). The three individuals from the Kofun period fall 

within the diversity of the Yellow River cluster. 

 
36 

 
Figure 1.8. Principal component analysis (PCA) of present-day East Eurasia with 
projected ancient individuals. Principal component analysis (PCA) visualising ancient Japanese 
individuals (i.e., Jomon, Yayoi, and Kofun) and continental ancient individuals (presented as coloured 
symbols consistent with the topographic map above) projected onto 112 present-day East Eurasians 
(represented by gray circles with Japanese highlighted in dark green).   


37 

 
An ADMIXTURE analysis using the “Human Origins Array” (HOA) data panel was 

conducted to identify ancestral components in each individual with the specified 

number of ancestral components (denoted as K) ranging from 2 to 12 (highlighted 

results in Fig. 1.9; full results in Appendix Fig. 4-5). In this analysis, a distinguishable 

Jomon component is observed from K=9 onwards (as represented by the colour red), 

which again highlights its unique profile within Asia. This component is also visible at 

high levels in “Yayoi_2” (“Yayoi_1” was excluded from this analysis for not meeting 

the minimum SNP threshold, 100,000 SNP sites) and at a reduced level in Kofun and 

Japanese. This Yayoi individual also features a number of other ancestral components 

not seen in Jomon; these proportions appear similar to the profile seen in the Amur 

River Basin and surrounding areas including a larger component that is dominant in 

Northeast Asians (represented by light blue) and another smaller component that 

represents much broader East Asian ancestry (represented by yellow). This East Asian 

component becomes much more dominant in the Kofun period and the modern 

Japanese population. 

These three initial analytical approaches create a consistent narrative about genomic 

changes within Japan over the past ~9,000 years. The Jomon appear to be a distinctive 

population within ancient and present-day Asia and seem to exhibit a low diversity 

within all twelve individuals in spite of representing a wide temporal and geographic 

range. While the individuals from the Yayoi period show a high affinity for the Jomon 

population, there are still clear differences in its relationship with continental 

populations and the composition of its ancestral profiles. The individuals from the 

Kofun period appear to be strongly distinct from those from the preceding cultures in 

all analyses, showing a higher affinity to present-day East Asians and lower shared 

genetic drift; they do, however, appear to retain a lingering Jomon ancestry. Overall, 

these results provide strong evidence that the major cultural changes in Japan were 

accompanied by changes in the region’s genomic profile. The next few sections will 

look at each of the Jomon, Yayoi and Kofun populations individually to further 

characterise their distinctive genetic identity. 

 
38 

 
Figure 1.9. Highlighted results from ADMIXTURE analysis. A subset of the results of the 
complete ADMIXTURE are presented here, showing the ancestral breakdown for Japanese and Ancient 
East Eurasian individuals (representing K = 11; complete pictures for all present-day and ancient 
individuals from K = 2 to K = 12 are presented in Appendix Fig. 1.5). The middle and bottom rows show 
selected Ancient East Eurasian populations from each geographical regions: Southern China (from left 
to right: Liangdao1, Liangdao2, Xitoucun), Yellow River (YR) (Middle Neolithic, YR_MN; Late 
Neolithic, YR_LN; Late Neolithic, Upper_YR_LN; Late Bronze Age/Iron Age, YR_LBIA; Iron Age, 
Upper_YR_IA), Northern China (Yumin, Bianbian, Boshan, Xiaogao), West Liao River (WLR) (Middle 
Neolithic, WLR_MN; Late Neolithic, WLR_LN; Bronze Age, WLR_BA; Middle Neolithic individual 
from Haminmangha, HMMH_MN; Bronze Age individual with a different genetic background from 
WLR_BA, WLR_BA_o), Amur River (AR) (Early Neolithic, AR_EN; Iron Age, AR_IA and 
AR_Xianbei_IA), Baikal (Early Neolithic, Shamanka_EN and Lokomotiv_EN), and Central Steppe 
(Botai, CentralSteppe_EMBA, Okunevo_EMBA). This set of the results shows a distinct Jomon 
ancestral component (represented by red), a component common in ancient samples from the Baikal 
region and the Amur River basin (represented by light blue), a broad East Asian component 
(represented by yellow) and a gray component that is most dominant in the Central Steppe.  

 
39 

 
Deep divergence of the Jomon lineage by geographic isolation 

The separation of the Jomon from other populations in the previous analyses supports 

the idea that they form a distinct lineage among East Eurasians, as proposed in 

previous studies (Gakuhari et al., 2020, Kanzawa-Kiriyama et al., 2019). The depth of this 

divergence was further explored here through reconstructing the phylogenetic tree of 

the twelve Jomon grouped as a single population with 17 other ancient and present-

day populations using TreeMix (Pickrell and Pritchard, 2012) with differing numbers 

of admixture events (Fig. 1.10). These results infer that the Jomon lineage emerged 

after the early divergences of Upper Paleolithic East Eurasians (Tianyuan and Salkhit) 

and ancient Southeast Asian hunter-gatherers (Hoabinhnian), but prior to the 

splitting off of other samples including present-day East Asians, an ancient Nepali 

(Chokhopani), hunter-gatherers from Baikal (Shamanka_EN and Lokomotiv_EN) 

and Chertovy Vorota Cave (Devil’s Gate Cave) in the Primorye Region, and a 

Pleistocene Alaskan (USR1).  

An interesting additional observation from this analysis is that phylogenetic results 

consistently infer gene flow from Jomon to the modern Japanese across all migration 

models tested, with genetic contributions ranging from 8.9% to 11.5% (Appendix Fig. 

1.6). This is consistent with the mean Jomon component of 9.31% in the present-day 

Japanese individuals estimated in the ADMIXTURE analysis (Fig. 1.9), supporting a 

lingering genetic contribution to present-day Japanese populations. 

 
Figure 1.10. Phylogenetic analysis of the Jomon lineage. Maximum-likelihood phylogenetic 
tree reconstructed by TreeMix under a model of two migrations. The tree shows a relationship among 
ancient (bold) and present-day (italic) populations. Coloured arrows represent the migration pathways 
and signals of admixture among all datasets. The migration weight represents the fraction of ancestry 
derived from the migration edge. All other migration models from m = 0 to m = 5 are presented in 
Appendix Fig. 1.8.   


40 

 
To further interrogate the phylogenetic position inferred by TreeMix for Jomon, and 

to get a more comprehensive picture of its genetic affinities with other ancient and 

present-day Asian populations, a series of formal tests were conducted based on f4 

statistics (Patterson et al., 2012). These tests are used to assess symmetry in allele 

sharing between all tested populations with the Jomon across different specified 

background populations, and are identified as being significantly closer to the Jomon 

(Z>3), significantly closer to the background population (Z<-3), or symmetrically 

related to both if there no significant difference in allele sharing (-3>Z>3). A wide-

spanning analysis of the SGDP panel in the form of  f4(Mbuti, X; Y, Jomon) - in which 

X refers to ancient and present-day populations in the SGDP panel and Y refers to a 

series of chosen background populations - show a clear difference in the overall pattern 

of genomic affinities based on the choice of background populations (Fig. 1.11a-e). 

Every East Asian population since the beginning of the Jomon period has a 

significantly higher affinity to Jomon compared to Hoabinhnian, but not when 

compared to Devil’s Gate Cave. This pattern suggests that the divergence of a 

population ancestral to Jomon occurred after that of the Southeast Asian Hoabinhnian 

and before the population from Devil’s Cave, consistent with the tree inferred by 

TreeMix. A previously proposed model in which Jomon were a mixture of 

Hoabinhnian and East Asian-related lineages (McColl et al., 2018, Wang et al., 2021a), 

was not supported by direct testing with (f3(Jomon; Hoabinhnian, DevilsCave_N) = 

0.193, Z = 61.355), or through two-way modelling using qpWave. (Appendix Table 

1.6). These results support a phylogeny in which three distinct hunter-gatherer 

lineages emerged in East Eurasia. 

An unexpected result from f4(Mbuti, X; Y, Jomon) is that the 31.6kya Upper Paleolithic 

“Yana_UP” pair of individuals (Sikora et al., 2019) are the only population that are 

significantly closer to Jomon than Han, Dai, or Japanese respectively (Z > 3.366). (Fig. 

1.11c-e). This affinity is still detectable even if when the reference populations is 

replaced with the other Southeast and East Asians (Appendix Table 1.7), supporting 

potential genetic exchange between the ancestors of Jomon and Ancient North 

Siberians, a lineage whose ancestry was widespread in North Eurasia prior to the LGM 

(Sikora et al., 2019, Massilani et al., 2020). A genetic affinity to the ~3.3kya 

Namazaga_CA population from South Turkmenistan (de Barros Damgaard et al., 

2018) is also observable when Han or Dai are used as a background population. This 


41 

 
result is unexpected, given the geographic distance between this population and the 

Japanese archipelago and the relatively recent age of the population. It is possible that 

this detected affinity may be unduly influenced by technical biases in how this test was 

conducted, and that the relationship between the two ancient populations of 

pseudohaploid individuals (Jomon and  Namazaga_CA) was overestimated when 

compared to modern diploid reference populations, rather than reflecting a true 

relationship between these populations. 

 
(a)  f4(Mbuti, X; Hoabinhian, Jomon) 

 
(b) f4(Mbuti, X; DevilsCave_N, Jomon) 

 
(c) f4(Mbuti, X; Japanese, Jomon) 

 
42 

 
(d) f4(Mbuti, X; Han, Jomon) 

 
(e) f4(Mbuti, X; Dai, Jomon) 

 
Figure 1.11. Geographic and tem