Subsampling bams

Subsampling with `samtools`

Subsampling is a way to filter your data. The subsampling script (below) will select bamfiles that meet a certain threshold of alignments, and then randomly subsample to meet that exact number of alignments. So you are comparing all individuals with the exact same number of alignments.

Look at Individual Alignments with samtools, can be done on bams (_.sort.bam), etc.
- samtools flagstat ALAME_AH_2_R1.sort.flt.bam
Example output:

94696 + 0 in total (QC-passed reads + QC-failed reads)
0 + 0 duplicates
94696 + 0 mapped (100.00%:-nan%)
94696 + 0 paired in sequencing
47492 + 0 read1
47204 + 0 read2
94696 + 0 properly paired (100.00%:-nan%)
94696 + 0 with itself and mate mapped
0 + 0 singletons (0.00%:-nan%)
1458 + 0 with mate mapped to a different chr
655 + 0 with mate mapped to a different chr (mapQ>=5)

Look at sizes of alignments
- du -hs *.sort.flt.bam | sort -h
- Note that these are sorted filtered bam files
Need to pick a threshold for sizes, we generally like alignments over 100,000, as it’s conservative but robust.
Example output (notice first few samples didn’t have very many reads)

332K    NFAME_IH_3_R1.sort.flt.bam
468K    NFAME_IH_2_R1.sort.flt.bam
8.0M    ALAME_AH_2_R1.sort.flt.bam
25M NFMFA_SC_3_R1.sort.flt.bam
31M NFAME_IH_1_R1.sort.flt.bam
33M MFAME_GC_1_R1.sort.flt.bam
34M NFAME_RR_1_R1.sort.flt.bam
36M ALAME_AH_1_R1.sort.flt.bam
37M RUBIC_PH_2_R1.sort.flt.bam
38M ALAME_AC_2_R1.sort.flt.bam
38M ALAME_AH_3_R1.sort.flt.bam
38M MFAME_GC_2_R1.sort.flt.bam
38M NFMFA_SC_2_R1.sort.flt.bam
39M NFAME_RR_2_R1.sort.flt.bam
44M RUBIC_PH_3_R1.sort.flt.bam

Subsampling Script


#!/bin/bash -l

#SBATCH -o slurm_outs/02c_subsample-%j.out
#SBATCH -J subsample

list=$1
num=$2

wc=$(wc -l ${list} | awk '{print $1}')

x=1
while [ $x -le $wc ] 
do
        string="sed -n ${x}p ${list}" 
        str=$($string)

        var=$(echo $str | awk -F"\t" '{print $1}')   
        set -- $var
        c1=$1

        count=$(samtools view -c ${c1}.bam)

        if [ $num -le $count ]
        then
                frac=$(bc -l <<< $num/$count)
                samtools view -bs $frac ${c1}.bam > ${c1}_${num}.bam
        fi

        x=$(( $x + 1 ))

done

Filter out individuals/bamfiles with low alignments

Make a bam sublist

Make a bamlist of samples that exceed some threshold of alignments or reads (e.g, typically 120k for salmon).

Run a few PCA with these different thresholds

To assess where alignments become too low, run PCA’s with different subsamples (e.g. 90k, 80k, 50k) until structure begins to fall apart.
Make a list of the bam files (and remove bam)
ls *flt.bam | sed 's/\.bam//g' > sublist

Subset to a certain number, here 100k reads

sbatch sub_sample.sh sublist 100000

Create a new subsampled bamlist

ls *_100000.bam > _subbamlist

Subsampling bams

Subsampling with samtools

Subsampling Script

Filter out individuals/bamfiles with low alignments

Subsampling `bams`

Subsampling with `samtools`