Subsampling with samtools

Subsampling is a way to filter your data. The subsampling script (below) will select bamfiles that meet a certain threshold of alignments, and then randomly subsample to meet that exact number of alignments. So you are comparing all individuals with the exact same number of alignments.

  • Look at Individual Alignments with samtools, can be done on bams (_.sort.bam), etc.
    • samtools flagstat ALAME_AH_2_R1.sort.flt.bam
  • Example output:
94696 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
94696 + 0 mapped (100.00%:-nan%)
94696 + 0 paired in sequencing
47492 + 0 read1
47204 + 0 read2
94696 + 0 properly paired (100.00%:-nan%)
94696 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
1458 + 0 with mate mapped to a different chr
655 + 0 with mate mapped to a different chr (mapQ>=5)
  • Look at sizes of alignments

    • du -hs *.sort.flt.bam | sort -h
    • Note that these are sorted filtered bam files
  • Need to pick a threshold for sizes, we generally like alignments over 100,000, as it’s conservative but robust.

  • Example output (notice first few samples didn’t have very many reads)

332K    NFAME_IH_3_R1.sort.flt.bam
468K    NFAME_IH_2_R1.sort.flt.bam
8.0M    ALAME_AH_2_R1.sort.flt.bam
25M NFMFA_SC_3_R1.sort.flt.bam
31M NFAME_IH_1_R1.sort.flt.bam
33M MFAME_GC_1_R1.sort.flt.bam
34M NFAME_RR_1_R1.sort.flt.bam
36M ALAME_AH_1_R1.sort.flt.bam
37M RUBIC_PH_2_R1.sort.flt.bam
38M ALAME_AC_2_R1.sort.flt.bam
38M ALAME_AH_3_R1.sort.flt.bam
38M MFAME_GC_2_R1.sort.flt.bam
38M NFMFA_SC_2_R1.sort.flt.bam
39M NFAME_RR_2_R1.sort.flt.bam
44M RUBIC_PH_3_R1.sort.flt.bam

Subsampling Script


#!/bin/bash -l

#SBATCH -o slurm_outs/02c_subsample-%j.out
#SBATCH -J subsample

list=$1
num=$2

wc=$(wc -l ${list} | awk '{print $1}')

x=1
while [ $x -le $wc ] 
do
        string="sed -n ${x}p ${list}" 
        str=$($string)

        var=$(echo $str | awk -F"\t" '{print $1}')   
        set -- $var
        c1=$1

        count=$(samtools view -c ${c1}.bam)

        if [ $num -le $count ]
        then
                frac=$(bc -l <<< $num/$count)
                samtools view -bs $frac ${c1}.bam > ${c1}_${num}.bam
        fi

        x=$(( $x + 1 ))

done

Filter out individuals/bamfiles with low alignments

Make a bam sublist

  • Make a bamlist of samples that exceed some threshold of alignments or reads (e.g, typically 120k for salmon).

Run a few PCA with these different thresholds

  • To assess where alignments become too low, run PCA’s with different subsamples (e.g. 90k, 80k, 50k) until structure begins to fall apart.
  • Make a list of the bam files (and remove bam)
  • ls *flt.bam | sed 's/\.bam//g' > sublist

Subset to a certain number, here 100k reads

  • sbatch sub_sample.sh sublist 100000

Create a new subsampled bamlist

  • ls *_100000.bam > _subbamlist