Usage

Finding BAM files

This command looks throughout a directory (including any sub directories within it) to find BAM files which it identifies by the '.bam' extension. It distinguishes a link from an actual file and records the location of an index file (if present) and the size of the BAM and index file. The output is sent to STDOUT and can be captured using the '>' and passed into an output file.

perl identity.pl --find --bam --input /path/directory/ --output outputFile.txt

Interpreting the output of --find --bam

The results are formatted as follows:

  1. File The file path and name
  2. FileName The file name without the path
  3. Location The file path without the name
  4. Link Whether the file is a symbolic link (Y) or not (N)
  5. LinkPath The location of the original file if it is a link (otherwise '.')
  6. Index The file path and name of the BAM index file if present (otherwise '.')
  7. Size The size in bytes of the BAM file and the BAM index together (0 for symbolic links)

Finding FinalReport files

This command looks throughout a directory (including any sub directories within it) to find FinalReport files which it identifies by the presence of 'FinalReport' in the file name and the '.txt' extension. It distinguishes a link from an actual file and records the size of the FinalReport file. The output is sent to STDOUT and can be captured using the '>' and passed into an output file.

perl identity.pl --find --fr --input /path/directory/ --output outputFile.txt

Interpreting the output of --find --fr

The results are formatted as follows:

  1. File The file path and name
  2. FileName The file name without the path
  3. Location The file path without the name
  4. Link Whether the file is a symbolic link (Y) or not (N)
  5. LinkPath The location of the original file if it is a link (otherwise '.')
  6. Index Not applicable to FinalReport files so recorded as '.'
  7. Size The size in bytes of the FinalReport file (0 for symbolic links)

Creating a SNP barcode index for BAM files

This command generates a 289 SNP barcode for each sample which is stored in FASTA format in a default index file using the filename and sample ID to make the config name. The user needs to specific the filetype (--bam), the genome build (--hg19 or --hg18), and the number of threads (--threads 8), which should be equal to the number of available processors on the node.

perl identity.pl --index --bam --hg19 --threads 8 --input /path/directory/ --output indexFile.fa

Interpreting the output of --index

Generally you should not need to open the index generated by this script since it will perform all subsequent analysis that is required. However it may be useful to inspect the index in the event of problems. If the --output command was used then the index will be in the specified file, otherwise the default index will be appended which will be in a file named 'FRBAM_289_idIndex.fa' in subdirectory named '/indexForIdentity' where the script is installed.

The index contains a 289 SNP barcode in FASTA format for every FinalReport or BAM file that was present in the input directory. The one exception is if a file has exactly the same sample name, file name, and barcode as a previous entry in which case it will not be added. For a FinalReport the contig name will be composed of the sample ID and the FinalReport name separated by a pipe:

>4255644144_A|/cluster/sanders/genotypeFr/Reclustered_FinalReport134.txt

For a BAM file the contig name will be composed of the sample ID and the BAM file name separated by a pipe:

>11232.p1|/cluster/sanders/bamFiles/Samp_11232.p1_SSC_BWA_sort_RG.bam

The SNP barcode will be composed of the following: 'A's, representing a homozygous SNP; 'C's, representing a heterozygous SNP; and 'N's representing a no call. The order of the SNPs is dictated by the hg19 chr and position.

Creating a SNP barcode index for FinalReport files

This command generates a 289 SNP barcode for each sample which is stored in FASTA format in a default index file using the filename and sample ID to make the config name. The user needs to specific the filetype (--fr), the genome build (--hg19 or --hg18), and the number of threads (--threads 8), which should be equal to the number of available processors on the node.

perl identity.pl --index --fr --hg19 --threads 8 --input /path/directory/ --output indexFile.fa

Interpreting the output of --index

Generally you should not need to open the index generated by this script since it will perform all subsequent analysis that is required. However it may be useful to inspect the index in the event of problems. If the --output command was used then the index will be in the specified file, otherwise the default index will be appended which will be in a file named 'FRBAM_289_idIndex.fa' in subdirectory named '/indexForIdentity' where the script is installed.

The index contains a 289 SNP barcode in FASTA format for every FinalReport or BAM file that was present in the input directory. The one exception is if a file has exactly the same sample name, file name, and barcode as a previous entry in which case it will not be added. For a FinalReport the contig name will be composed of the sample ID and the FinalReport name separated by a pipe:

>4255644144_A|/cluster/sanders/genotypeFr/Reclustered_FinalReport134.txt

For a BAM file the contig name will be composed of the sample ID and the BAM file name separated by a pipe:

>11232.p1|/cluster/sanders/bamFiles/Samp_11232.p1_SSC_BWA_sort_RG.bam

The SNP barcode will be composed of the following: 'A's, representing a homozygous SNP; 'C's, representing a heterozygous SNP; and 'N's representing a no call. The order of the SNPs is dictated by the hg19 chr and position.

Identifying duplicate samples in the index

This command identifies duplicate samples within a single index. There are three situations in which a duplicate may exist:

  1. The data have been processed twice resulting in a different file name
  2. A sample is labelled incorrectly, possibly reflecting incorrect data generation (i.e. sample mix up)
  3. Monozygotic twins

A duplicate will not occur in the presence of closely related family members that are not monozygotic twins.

perl identity.pl --dup --input indexFile1.fa --output outputFile.txt

Interpreting the output of --dup
Each line in the output file represents one file in the index. If there are multiple matches for one file in the index then the file will list one line per match. A good match is defined as over 80% BLAT similarity; a bad match is between 20 and 80% BLAT similarity; less than 20% BLAT similarity counts as no match. A bad match is only shown if there are no good matches.

  1. Input_file_in_indexFile1.fa The name of the file in the index
  2. Input_sample_in_indexFile1.fa The name of the sample in the index
  3. Matching_file_in_indexFile1.fa The name of the matching file in the index; if there is no matching file then 'No_match' is shown
  4. Matching_sample_in_indexFile1.fa The name of the matching sample in the index; if there is no matching file then 'No_match' is shown
  5. Input_Index_in_indexFile1.fa The index 'barcode' of the file in the index
  6. Matching_Index_in_indexFile1.fa The index 'barcode' of the matching file in the index; if there is no matching file then '.' is shown
  7. Percent_match The BLAT percent similarity between two index barcodes
  8. Number_of_matching_files The number of good matches for the file
  9. Matching_file_number If there are multiple matches these are numbered in order

Comparing two indexes

This command is designed for matching FinalReports and BAM files but can be used to identify identical samples between any two datasets.

perl identity.pl --compare --input indexFile1.fa indexFile2.fa --output outputFile.txt

Interpreting the output of --compare
Each line in the output file represents one file in the first index. If there are multiple matches in the second index for one file in the first index then the file will list one line per match. A good match is defined as over 80% BLAT similarity; a bad match is between 20 and 80% BLAT similarity; less than 20% BLAT similarity counts as no match. A bad match is only shown if there are no good matches. Note that duplicate samples in the second index that do not match any files in the first index will not be identified (consider running the —dup command).

  1. Input_file_in_indexFile1.fa The name of the file in the first index
  2. Input_sample_in_indexFile1.fa The name of the sample in the first index
  3. Matching_file_in_indexFile2.fa The name of the matching file in the second index; if there is no matching file then 'No_match' is shown
  4. Matching_sample_in_indexFile2.fa The name of the matching sample in the second index; if there is no matching file then 'No_match' is shown
  5. Input_Index_in_indexFile1.fa The index 'barcode' of the file in the first index
  6. Matching_Index_in_indexFile2.fa The index 'barcode' of the file in the second index; if there is no matching file then '.' is shown
  7. Percent_match The BLAT percent similarity between two index barcodes
  8. Number_of_matching_files The number of good matches for the file
  9. Matching_file_number If there are multiple matches these are numbered in order
Unless otherwise stated, the content of this page is licensed under Creative Commons Attribution-ShareAlike 3.0 License