staff project download information miscellaneous
Vect   Vect Reference Manual
  Installing Perl
Mac
Windows
Unix


Download

Reference Manual
Introduction
Overview
Input Panel
Convert Panel
Output Panel
Perl Program Panel


Tutorials
Numerical Data Extraction
Statistical Data Extraction
Patent Calculation
PDB Data Extraction
GenBank Data Extraction
Tabular Data Analysis
Word Mapping
DNA to Protein Extraction

Change Log


FAQ

Cookbook
 
MangoPicky DownloadLucy2 DownloadTrend DownloadGRAMAUBViz DownloadgeneDBN Download

Input Data Panel (continued)

4. Forced Line-Column Selection

A line-column can be selected that is completely inside a field by holding down the shift key while left-clicking and dragging over the desired region. The shift key forces the selection to ignore what field it is in and create the column that has boundaries relative to the beginning of the line.

5. Line Restriction

It is common when extracting information from a document to be looking for text that occurs on lines with specific key words or phrases. To allow you to focus on these key lines Vect has the Line Restriction option.

To specify only those lines containing a desired word or number, right-click and drag over the key word or phrase. If you are selecting exactly one field (word) then a single right-click on the word will accomplish the same result. This will highlight the chosen text and a drop-down menu will appear. Select New Line Selection Condition from this menu. The chosen text will become highlighted in yellow on every line in which it occurs. It is only in these lines that data extraction will now occur.

At this point you will notice that the lines that have not been restricted in this manner have become highlighted in pink. The pink lines are no longer available for data selection. For example if you now perform a field selection, only fields on the restricted (non-pink) lines will be selected (highlighted in grey). Here we intend ‘restricted’ to be understood in the sense that all data accumulation for this rule is restricted to these (non-pink) lines. But it is important to note that restricting lines in this way does not automatically select the data from them. You must still select data from these lines using the methods described in the previous section. Only data highlighted in gray will be extracted from the document.

In a formatted document it is often the case that specific key words will have a specific indentation distance on the line in which they occur, so that their position on the line will always be the same. Vect recognizes this and as a default not only considers the key word, but also its position on the line (Position Dependent). In this way it only recognizes these special indented occurrences of the word and not incidental occurrences that may appear in the body of the text. This default feature can be changed by right-clicking on the yellow highlighted key word and selecting Position Independent from the pull down menu. Now every line with the chosen key word, regardless of its position on the line, will become a restricted line that is available for data extraction.

6. Block Restriction

Similar to Line Restriction, Block Restriction allows restricting the data available for selection to a smaller region. While Line Restriction allows restricting data selection to specific lines based on a single condition, Block Restriction limits data selection to blocks of data specified by opening and closing conditions.

The opening and closing conditions for the Block Restriction are selected in the same way as described for the Line Restriction ( right-click and drag method). The opening condition is marked by choosing the New Block Open Condition from the drop down menu and will be highlighted in green. New Block Close Condition is chosen to mark the closing condition, which will be highlighted in red.

To deselect either of the block conditions right-click on the highlighted keyword and select Cancellation from the drop-down menu.

As in the Line Restriction method, Block restriction defaults to Position Dependent for these keywords. This can be modified in the same way as described in the previous section.

As a default, the Block Restriction allows the Block Open and Block Close keywords to appear on the same line. There are occasionally situations where this is not desirable. To prevent this from being allowed it is necessary to make both the open and closing conditions Selection Exclusive . This is done with a right-click on the keyword and choosing Selection Exclusive from the drop-down menu for each keyword.

A sample use of this option can be shown in the following diagram. In this example, the Block Open condition is made on ‘/translation=’ and the Block Close condition is made on the quotation mark (“).

Here the Block Close condition (highlighted red) must be made Position Independent since the amino acid sequence could end at any position on a line. Both the Block Open (green) and the Block Close (red) conditions must be made Selection Exclusive to prevent them from occurring on the same line (i.e. preventing the quotation mark that immediately follows the opening condition from being used as the close condition.) Note that in this case the Block Open condition can remain Position Dependent since in this document format it will always occur in the same position on the line. The required settings for these options will depend on the document’s formatting.

It is not necessary that every Block Open condition have a matching Block Close condition. In cases there where no Block Close condition is selected, the end of a file will mark the end of the current block.

Nested block conditions can be defined to further restrict the selectable regions, which is a powerful feature of block selections. Nested block conditions bring up the concept of “Level”. The level of the data represents how deeply nested it is within a nested block condition. This can be seen in the column to the left of the Data Input Panel. In this side column we first see the line numbers given, and then immediately to the right of the line numbers are the level numbers.

Useful Points to Remember

• While field or column data selection is done with a left-click or left-drag, for Line and Block Restrictions a right-click or right-drag is used.

• Block and Line Restrictions do not select the data in the specified block or line automatically. These methods are just used to restrict the selectable areas in the whole input data. Users still must explicitly select the fields using ‘field selection’ and ‘column selection’ and then the selected data will be highlighted in grey.

• After the Line or Block Restriction is defined and the text is selected for extraction, go over the entire input file to make sure the text that you desired is being selected. Depending on the text format, you may have to tweak the line/block definitions to extract the text that you desire.

Last modified June 13, 2008 . All rights reserved.

Contact Webmaster

lab