FileScanner is a Java-application (executable jar net.ligaya.p3.filescanner), which scans and transforms "unstructured" (table) data files (f.ex. in HTML-format) into converted "structured" data files. The scanning is based on a format file which specifies the "syntax" of the files to be scanned as well as the syntax of the generated file.

JAVA REQUIREMENTS

The Java interpreter (java.exe) must be in the path. (Control Panel->System->Advanced->Environment Variables where it is one of the User Variables). Set path to e.g. c:\j2sdk1.4.2_10\bin).

COMMAND LINE ARGUMENTS

FileScanner has four mandatory command line arguments (separated by at least one blank):

  • file-list file name
  • format-file file type
  • execution trace file
  • error file

Example (invoking FileScanner from the command line):

java -jar net.ligaya.p3.filescanner.jar c:\pro3kbs\mykb\files.tmp 3ff
c:\pro3kbs\mykb\fscanner.log c:\pro3kbs\mykb\fscanner.err /OT=scantest.3dl
/ff=c:\pro3kbs\mykb\

(Note that Pro/3 has a facility to automatically invoke FileScanner)

FILE-LIST FILE

The file-list file is an ASCII-file with one record per file to be scanned.

Example

C:\OSE-HTML\TEST\st000313.stk
C:\OSE-HTML\TEST\st000314.stk
C:\OSE-HTML\TEST\st000315.stk
C:\OSE-HTML\TEST\st000317.stk
C:\OSE-HTML\TEST\st000321.stk
C:\OSE-HTML\TEST\st000322.stk
C:\OSE-HTML\TEST\st000323.stk
C:\OSE-HTML\TEST\st000324.stk
C:\OSE-HTML\TEST\st000329.stk
C:\OSE-HTML\TEST\st000330.stk
C:\OSE-HTML\TEST\st000331.stk
C:\OSE-HTML\TEST\st000403.stk
C:\OSE-HTML\TEST\st000404.stk
C:\OSE-HTML\TEST\st000405.stk

FORMAT FILE

The format-file specifies the "syntax" of the input file and how the scanned components from the file are outputted to the output file. The codes used for the format-file are described under.

Example

[STA]="Oms i går"
[END]="<!-- /BORSDATA -->"
[SUB]="</TD>" ";"
[SUB]="</td>" ";"
[SUB]="&nbsp;" ""
[SKB]="<" ">"
[NUL]="" ";"
[SIR]=";"
[EIR]=WL ";" EOF
[INP]=WL <ticker> ";" ";" ";" ";" <close> ";" <high> ";" <low> ";" <volume> ";" ";" ";" ";"
[OHD]=";" L "is reported;daily stock transaction" L
[OUT]=<close> ";" <high> ";" <low> ";" <ticker> ";" <volume> ";" FILENAME(1,8) L
----- nodes -----
<ticker>=aN
<close>=n.2
<high>=n.2
<low>=n.2
<volume>=i

There are four optional execution options:

  • /OT=xxx where xxx either is a file type or a file name (incl. type). One output file is generated if xxx is a file name. One output file per input file is generated if xxx is a file type.
  • /FF=xxx where xxx is the path for format file(s).
  • /OF=xxx where xxx is the path for output file(s).
  • /TR=

The /TR= option activates execution trace to the execution trace file (for debugging purposes).

Default values:

  • /OT=ftc
  • /FF=iii where iii is the path of the corresponding input file.
  • /OF=iii where iii is the path of the corresponding input file or the file-list file if /OT is a file name.

Note that the format file corresponding to input file nnnnn.ttt is ttt.fff where fff is the format-file file type.

FileScanner Format File

The format file contains several records of various type. The record types are listed under.

I. FORMAT-FILE FORMAT

Tag-record
Tag-record
:
----- nodes -----
Node-record
Node-record
:

II. TAG RECORDS

Tag-records have the general format [tag]=tag_definition tag_definition … .

Scanning is carried according to the following steps:

  1. Preprocessing of the file including optionally removal of front- and tail-parts, deletion of bracketed sub-strings and string substitutions.
  2. Optional scanning up to first input record.
  3. Checking for end-of-input-records condition.
    1. if not end-of-records
      1. Scanning of input record.
      2. Generation of output record.
      3. Repeat from 3.
    2. if end-of-records
      1. Exit

Preprocessing:

[END] = string

Directs the scanner to ignore all text after (and including) the first occurrence of the given strings. The string may or may not be present in the scanned file.

[SKB] = string ...

Skips all text with the bracket strings (including the bracket strings). The first string specifies the opening bracket, while the second string specifies the closing bracket.
Note!
It is assumed (i) that there is always a right bracket matching a left bracket, and (ii) that brackets are not nested.

[STA] = string

Directs the scanner to ignore all text prior to (and including) the first occurrence of given string. The string may or may not be present in the scanned file.

[SUB] = string ...

Substitutes all occurrences of the first string with the second string.

Global settings used in input record scanning:

[ERR] = string ...

Directs the scanner to recognize the given strings as error values. Records with error values are not outputted, and a warning is written in the log file.

[NUL] = string string ...

Directs the scanner to regard the first string in each string pair as null values. The second string in the pairs must be the string following the null value in the input text. Null values actually used when null value strings are encountered are "0" (i) "0.0" (n), "" (a) and the date 01-Jan-1900.

Input record scanning:

[SIR] = input_field ... (any of the following: string, W, L, WL)

Scans the specified fields for the purpose of reaching the start of the input records.

[EIR] = input_field ... (any of the following: string, W, L, WL, EOF)

Specifies a field pattern which signifies the end of the input records (only specified if the end of input records are before end-of file after pre-processing).

[INP] = input field input field …

Defines the format of what is considered the file's (input) record format. There will normally be many records in the file.

[END] = input field input field ...

Defines the format of what constitues the end of all input records (this is end-of-file by default).

[OUT] = output field output field …

Defines a footer-record which will be generated after all the output records.

Output record formatting:

[OHD] = output field output field …

Defines the format of the output header record. One output header record will be outputted before the output records.

[OUT] = output field output field …

Defines the format of the output record. One output record will be generated for each input record.

input field

  • string (scans up to and including the first occurrence of the given string - the given string is enclosed in quotes)
  • node (scans a named node)
  • L (scans up to and including first line break)
  • W (scans up to end of white space (excl. line breaks) (if any))
  • WL (scans up to end of white space (incl. line breaks) (if any))

output field

  • string (outputs given string)
  • node (outputs a named node)
  • L (outputs a line break)
  • Q (outputs a double quote)
  • FILENAME(n,m) (outputs the entire (or portion of) the scanned file's filename (i.e. excluding path and type). n is the start position (1,2,..) and m is the length (or 0 if up to the end of the name))
  • FILEPATH(n,m) (outputs the entire (or portion of) the scanned file's path. n is the start position (1,2,..) and m is the length (or 0 if up to the end of the name)).
  • FILETYPE(n,m) (outputs the entire (or portion of) the scanned file's filetype. n is the start position (1,2,..) and m is the length (or 0 if up to the end of the name).

other fields

  • EOF (end of file)

III. NODE RECORDS

A node record defines a node:

<node_name> = format

  • a (examples: a9 a14 aN)
    • n scans n positions
    • N scans up to the start of next input field which must be a string
  • d (examples: dDD_MM_YY dDD_MMM_YYYY)
    • DD_MM_YY
    • DD_MMM_YY
    • DD_MM_YYYY
    • DD_MMM_YYYY
  • i (examples: iB i. i, i.5)
    • B ignore blanks between digits
    • . optional periods in the integer (999.999)
    • , optional commas in the integer (999,999)
    • .n n character wide field with optional periods in the integer (999.999)
    • ,n n character wide field with optional commas in the integer (999,999)
    • i scans up to first non-blank and non-digit character
  • nus_numeric_format (999,999.99) (examples: n.3 n.12)
    • .n number with n (0,1,..) decimals after decimal point (.) and optional commas in the integer-part
  • wnorwegian_numeric_format (999.999,99) (examples: w.0 w.12)
    • .n number with n (0,1,..) decimals after decimal point (,) and optional periods in the integer-part