nudup.py is a Python script that marks/removes PCR introduced duplicated molecules based on the molecular tagging technology used in Tecan products. It can be used for both single-end and paired-end reads. It requires a SAM/BAM file and a FASTQ file as inputs.
nudup.py can be obtained
here
It is clear from the README that nudup.py has only been tested on legacy software versions.
For expected behavior, I recommend configuring your environment to match the stated prerequisites.
> module load EBModules
> module load SAMtools/1.14-GCC-10.3.0
> module load Python/2.7.16-GCCcore-8.3.0
> which python
/grid/it/data/elzar/easybuild/software/Python/2.7.16-GCCcore-8.3.0/bin/python
Try running nudup.py without any options to ensure there are no errors:
> python nudup.py
usage: nudup.py [-2] [-f INDEX.fq|READ.fq] [-o OUT_PREFIX] [-s START]
[-l LENGTH] [-T TEMP_DIR] [--old-samtools] [--rmdup-only] [-v]
[-h]
IN.sam|IN.bam
nudup.py: error: too few arguments
nudup.py works with aligned and sorted BAM or SAM files from next-generation sequencing (NGS).
As a test, I cloned this
repository which contained some test BAM files.
> python nudup.py -o test_bam_dedup ../rnaseq/test/hisat2_k20.bam
2023-05-11 14:52:36,599 [ INFO] - Deduplicating NuGEN single end reads...
2023-05-11 14:52:36,654 [ INFO] - Processing sorted SAM/BAM with molecular tag sequence in read name (assumes sorted)
2023-05-11 14:52:36,929 [ INFO] - Aligned count: 24290
2023-05-11 14:52:36,929 [ INFO] - Unaligned count: 0
2023-05-11 14:52:36,929 [ INFO] - Molecular tag dups count: 0 (0.0000 rate)
2023-05-11 14:52:36,929 [ INFO] - Deduplication success.
2023-05-11 14:52:36,929 [ INFO] - Created output file test_bam_dedup.sorted.markdup.bam with duplicates marked
2023-05-11 14:52:36,929 [ INFO] - Created output file test_bam_dedup.sorted.dedup.bam with duplicates removed
Robert Petkus 5/11/23