ISU Complex Computation Lab - GenBank Report Data Extraction

GenBank Report Data Extraction


	Installing Perl Mac Windows Unix Download Reference Manual Introduction Overview Input Panel Convert Panel Output Panel Perl Program Panel Tutorials Numerical Data Extraction Statistical Data Extraction Patent Calculation PDB Data Extraction GenBank Data Extraction Tabular Data Analysis Word Mapping DNA to Protein Extraction Change Log FAQ Cookbook

Extraction and Conversion of Gene Names

We will now extract the gene names in the Arabidopsis file. In Vect, make sure you are in the 'Input Data' panel with the AC006439.txt file opened. Notice that each gene name appears in several lines (under mRNA , gene, and CDS block) of the Arabidopsis file unlike protein id (protein id is only under CDS block). We will need to make several block conditions in order to extract only one of several identical gene names. The figure below highlights the various locations for each gene name.

Right click on the 'CDS' letters and choose New Block Open Condition from the pull down menu. A green box will appear.
Right click on the 'mRNA' letters and choose New Block Close Condition from the pull down menu . A red box will appear.

Right click and drag over the '/gene=' block and choose New Line Selection from the pull down menu.
Select the gene name with a left-click. Now only gene names located in the 'CDS' block will be chosen and all others will be excluded.

Note: If you do not have any grey highlighted regions in the 'Input Data' panel then you have not selected any data and no data can be moved over to the next panel.

Select 'Move' from the icon panel to move to the 'Convert Data' panel. Give your rule a description name (here just use ‘raw gene names’.)

It is now important to get the gene names only without the quotes around it. Gene names will be extracted through the 'Quoted Data Rule.' In the 'Convert Data' panel select 'Insert' from the icon panel and select the 'To Extracted Quoted Data from other Rule.' Give your rule a descriptive name (here just use ‘gene names’) and select the rule you wish to apply it to (here raw gene names). Select the grey highlighted boxes labeled 'nothing' and put a quote (") mark in both. Your data should be similar to the figure below.

Select Rule 2 (the concatenated sequence) then Select the 'Copy' button from the icon panel to move your data to the 'Output Data' panel by. In the 'Output Panel' users can add any text format to the data set and view the changes by selecting the 'Output' icon in the icon panel.

The tag should not be modified but can be moved around. If users wish to limit the output to a set number of lines, the tag may be edited by including a ':width' before the closing bracket (>). This restricts the body from flowing past the specified width. Example: <gene sequence:60>.

To show the Perl code, move to the 'Perl Program' panel and select 'Compile.' Your Perl program appears as shown below. To run the program generated, select the 'Run' icon. A new window will appear with the results of your Perl program.

If you want to see examples, please check previous mini tutorial for gene sequence extraction and conversion.