Anchored substring extraction:
You have a key short string, say 20 bases, that you would like to find in longer input sequences. Once you found each instance of it, you would like to extract a range (say 100 upstream and downstream) of substring anchored from that location. Assuming you have separately download files containing data that are related and saved into a directory. Now here is what can be done:
0. Load a sample file into Vect;
1. Create rules that will concatenate all DNA sequence fragments into single-line sequences, assuming you have multiple sequences in your files;
2. Pipe your source sequence data into a simple user rule with the following content:
sub { # Do not name your subroutine to avoid name conflicts.
my ($source) = @_;
return if !defined($source);
my @outputs;
my $start = 0;
while (($start=index($source, "tttattaa", $start))>=0) {
my $left = $start-100;
$left = 0 if $left<0;
my $length = $start-$left+120;
push @outputs, substr($source, $left, $length);
$start += 120;
}
return (@outputs);
}
Just change "tttattaa" to your 20 base substring.
3. Pipe the data from the above rule to the output, wrap lines accordingly.
4. Generate your Perl code.
Note that Vect does not create files; Vect always sends its output to the console, but you can redirect the output from the Vect generated Perl program to a file using the I/O redirection operator '>', like
perl_program file1 file2 file3 ... > single_output_file
on a Unix-like machine (including Mac OS X).
An example is attached here. Download it onto your computer and open it into Vect and at last, get a GenBank report file to see the results.
|