VK545 Ancient DNA Revisited
Posted: Mon, 2024-Jan-29 7:28 pm
This is a follow-on to the Population Genomics of the Viking World thread.
As best as I can determine the VK545 sample from the Ship Street, Dublin, Ireland site was aligned and analyzed using the GRCh37/hg19 reference. YFull shows this sample as being R1b-A225 under R1b-A223. However, GRCh37/hg19 is not the best reference.
[ https://www.yfull.com/branch-info/R-Y3646/#t4-tab ]
Given the enhanced coverage YFull is seeing with current Y-DNA sequencing using the CP086569.2 T2T reference, I thought it would be interesting to realign the VK545 sample to it as well. I downloaded the VK545.final.bam file from ENA and realigned it to the CP086569.2 T2T reference using Samtools to convert the BAM file to FASTQ; then BWA-MEM to realign to the CP086569.2 reference; and then Samtools to remove duplicates. I then used Bcftools mpileup/call to generate a VCF. There were certain intermediate steps in the pipeline that have been left out for simplicity. I did a whole genome realignment and not just a chrY realignment.
[ https://www.ebi.ac.uk/ena/browser/view/ ... show=reads ]
Doing this I got a very interesting result. I had previously written a PHP script to help me analyze VCFs. It is geared towards kits that have upgraded their NGS results, not towards ancient Y-DNA, but it still works well enough for ancient Y-DNA. It looks for all known R1b-DF104 upstream variants in the sample VCF and confirms and then ignores them. It then looks for all R1b-DF104+ known variants and reports those as well as any unknown variants. Here is a table of the results:
The TIER column represents the subclade depth under R1b-DF104, which is TIER 0. A TIER value of 100 means the variant is unknown. The remaining columns are standard VCF columns.
Since the read depth is mainly 1, it is, of course, somewhat problematic. But DF105 has 2 reads and DF109 has 1 read. The other calls under R1b-DF104 are probably noise or contamination given their low QUAL scores, which even DF109 has. I don't have a tool to be able to examine whether DF108 is negative or a no-call in the realigned BAM file. But the 5 variants at the bottom are relatively strong with read depths of 3 and 4, assuming they are not sequencing artifacts, which seems unlikely.
At this point in time, it is unlikely these 5 variants are upstream of R1b-DF104. So IF DF108 is negative in the VK545 sample, the VK545 sample MAY split the R1b-DF105 phylogenetic node and be in a parallel branch comprised of those 5 variants. Since we have not seen this in current testing, perhaps this is an extinct branch. If so, then this is very interesting.
On the other hand, if DF108 is a no-call and likely positive, then I would assume that the VK545 sample is providing a potentially new direct subclade under R1b-DF105, or possibly within one of the other direct subclades, depending on negative or no-call results for those variant positions. This is still quite interesting.
Regardless, it seems clear that the VK545 sample is NOT R1b-A225+ since there are no confirming reads for it. IDK if anyone else has or is doing realignment of ancient samples, but from my experiment it would appear to be a worthwhile effort. Further, if anyone has the IGV tool or something similar, I will be happy to provide a copy of the realigned BAM file for viewing. Please PM me if you are interested.
As best as I can determine the VK545 sample from the Ship Street, Dublin, Ireland site was aligned and analyzed using the GRCh37/hg19 reference. YFull shows this sample as being R1b-A225 under R1b-A223. However, GRCh37/hg19 is not the best reference.
[ https://www.yfull.com/branch-info/R-Y3646/#t4-tab ]
Given the enhanced coverage YFull is seeing with current Y-DNA sequencing using the CP086569.2 T2T reference, I thought it would be interesting to realign the VK545 sample to it as well. I downloaded the VK545.final.bam file from ENA and realigned it to the CP086569.2 T2T reference using Samtools to convert the BAM file to FASTQ; then BWA-MEM to realign to the CP086569.2 reference; and then Samtools to remove duplicates. I then used Bcftools mpileup/call to generate a VCF. There were certain intermediate steps in the pipeline that have been left out for simplicity. I did a whole genome realignment and not just a chrY realignment.
[ https://www.ebi.ac.uk/ena/browser/view/ ... show=reads ]
Doing this I got a very interesting result. I had previously written a PHP script to help me analyze VCFs. It is geared towards kits that have upgraded their NGS results, not towards ancient Y-DNA, but it still works well enough for ancient Y-DNA. It looks for all known R1b-DF104 upstream variants in the sample VCF and confirms and then ignores them. It then looks for all R1b-DF104+ known variants and reports those as well as any unknown variants. Here is a table of the results:
| TIER | POS | ID | REF | ALT | QUAL | FILTER | INFO |
| 1 | 13391744 | DF105 | G | A | 46.4146 | . | CLADE=R1b-DF105;DP=2 |
| 1 | 21254183 | DF109 | A | T | 8.99921 | . | CLADE=R1b-DF105;DP=1 |
| 2 | 13575770 | FGC65032 | C | T | 8.99921 | . | CLADE=R1b-FGC65031;DP=1 |
| 4 | 20100623 | MF165555 | C | T | 7.30814 | . | CLADE=R1b-FTB39547;DP=1 |
| 5 | 20588609 | BY18188 | C | T | 7.30814 | . | CLADE=R1b-BY18120;DP=1 |
| 5 | 20721167 | FT119260 | G | A | 8.99921 | . | CLADE=R1b-FT115566;DP=1 |
| 6 | 11711058 | FGC8438 | G | A | 5.75677 | . | CLADE=R1b-BY18320;DP=1 |
| 6 | 16714581 | A11307 | G | A | 8.99921 | . | CLADE=R1b-A11307;DP=1 |
| 7 | 10467844 | FTB91864 | G | A | 8.99921 | . | CLADE=R1b-BY48495;DP=1 |
| 7 | 20877246 | FTC5774 | G | A | 8.13869 | . | CLADE=R1b-FTC5557;DP=1 |
| 8 | 7304629 | BY18132 | G | A | 8.99921 | . | CLADE=R1b-BY18132;DP=1 |
| 8 | 7318089 | M9520 | C | T | 8.99921 | . | CLADE=R1b-B24;DP=1 |
| 8 | 8634006 | Y129700 | G | A | 8.99921 | . | CLADE=R1b-FT285980;DP=1 |
| 8 | 17468716 | FTB12149 | G | A | 8.99921 | . | CLADE=R1b-BY47745;DP=1 |
| 8 | 18252929 | BY20817 | G | A | 8.99921 | . | CLADE=R1b-FT82182;DP=1 |
| 8 | 20099847 | BY132284 | G | A | 8.99921 | . | CLADE=R1b-BY146806;DP=1 |
| 9 | 6894312 | PH432 | G | A | 3.22451 | . | CLADE=R1b-FT109536;DP=1 |
| 9 | 15071932 | BY106599 | G | A | 8.99921 | . | CLADE=R1b-BY98600;DP=1 |
| 10 | 3436741 | FTC32491 | C | T | 8.99921 | . | CLADE=R1b-FTA14514;DP=1 |
| 10 | 15154507 | FGC62848 | G | A | 8.13869 | . | CLADE=R1b-FGC62843;DP=1 |
| 10 | 15525920 | Y52254 | C | T | 3.22451 | . | CLADE=R1b-BY18352;DP=1 |
| 11 | 8341530 | BY73963 | G | A | 5.04598 | . | CLADE=R1b-BY65078;DP=1 |
| 12 | 16208403 | Y26014 | G | A | 8.99921 | . | CLADE=R1b-Y26014;DP=1 |
| 13 | 17421295 | FT207643 | C | T | 8.99921 | . | CLADE=R1b-BY137737;DP=1 |
| 13 | 27258802 | FT178070 | C | T | 8.99921 | . | CLADE=R1b-BY16967;DP=1 |
| 14 | 12603472 | FGC19840 | G | A | 6.51248 | . | CLADE=R1b-FGC19856;DP=1 |
| 100 | 13157239 | . | G | A | 78.4149 | . | QD=26;DP=3;VDB=0.470313;SGB=-0.511536;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,1,2;MQ=60 |
| 100 | 15092006 | . | A | G | 104.415 | . | QD=26;DP=4;VDB=0.0320192;SGB=-0.556411;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,1,3;MQ=60 |
| 100 | 16408342 | . | G | A | 74.4149 | . | QD=24;DP=3;VDB=0.0900131;SGB=-0.511536;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,2,1;MQ=60 |
| 100 | 18136756 | . | G | A | 77.4149 | . | QD=25;DP=3;VDB=0.71601;SGB=-0.511536;MQSBZ=0;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,2,1;MQ=60 |
| 100 | 18189304 | . | C | T | 62.4147 | . | QD=20;DP=3;VDB=0.318564;SGB=-0.511536;FS=0;MQ0F=0;AC=1;AN=1;DP4=0,0,3,0;MQ=60 |
Since the read depth is mainly 1, it is, of course, somewhat problematic. But DF105 has 2 reads and DF109 has 1 read. The other calls under R1b-DF104 are probably noise or contamination given their low QUAL scores, which even DF109 has. I don't have a tool to be able to examine whether DF108 is negative or a no-call in the realigned BAM file. But the 5 variants at the bottom are relatively strong with read depths of 3 and 4, assuming they are not sequencing artifacts, which seems unlikely.
At this point in time, it is unlikely these 5 variants are upstream of R1b-DF104. So IF DF108 is negative in the VK545 sample, the VK545 sample MAY split the R1b-DF105 phylogenetic node and be in a parallel branch comprised of those 5 variants. Since we have not seen this in current testing, perhaps this is an extinct branch. If so, then this is very interesting.
On the other hand, if DF108 is a no-call and likely positive, then I would assume that the VK545 sample is providing a potentially new direct subclade under R1b-DF105, or possibly within one of the other direct subclades, depending on negative or no-call results for those variant positions. This is still quite interesting.
Regardless, it seems clear that the VK545 sample is NOT R1b-A225+ since there are no confirming reads for it. IDK if anyone else has or is doing realignment of ancient samples, but from my experiment it would appear to be a worthwhile effort. Further, if anyone has the IGV tool or something similar, I will be happy to provide a copy of the realigned BAM file for viewing. Please PM me if you are interested.