FileScanner is a
Java-application (executable jar net.ligaya.p3.filescanner), which scans and transforms
"unstructured" (table) data files (f.ex. in
HTML-format) into converted "structured" data files.
The scanning is based on a format file which specifies the
"syntax" of the files to be scanned as well as the
syntax of the generated file.
JAVA REQUIREMENTS
The Java interpreter (java.exe) must be in the
path. (Control Panel->System->Advanced->Environment
Variables where it is one of the User Variables). Set path
to e.g. c:\j2sdk1.4.2_10\bin).
COMMAND LINE ARGUMENTS
FileScanner has four mandatory command line arguments
(separated by at least one blank):
- file-list file name
- format-file file type
- execution trace file
- error file
Example (invoking
FileScanner from the command line):
java -jar
net.ligaya.p3.filescanner.jar c:\pro3kbs\mykb\files.tmp 3ff
c:\pro3kbs\mykb\fscanner.log c:\pro3kbs\mykb\fscanner.err
/OT=scantest.3dl
/ff=c:\pro3kbs\mykb\
(Note that Pro/3 has a facility to
automatically invoke FileScanner)
FILE-LIST FILE
The file-list file is an ASCII-file with one record per file
to be scanned.
Example
C:\OSE-HTML\TEST\st000313.stk
C:\OSE-HTML\TEST\st000314.stk
C:\OSE-HTML\TEST\st000315.stk
C:\OSE-HTML\TEST\st000317.stk
C:\OSE-HTML\TEST\st000321.stk
C:\OSE-HTML\TEST\st000322.stk
C:\OSE-HTML\TEST\st000323.stk
C:\OSE-HTML\TEST\st000324.stk
C:\OSE-HTML\TEST\st000329.stk
C:\OSE-HTML\TEST\st000330.stk
C:\OSE-HTML\TEST\st000331.stk
C:\OSE-HTML\TEST\st000403.stk
C:\OSE-HTML\TEST\st000404.stk
C:\OSE-HTML\TEST\st000405.stk
FORMAT FILE
The format-file specifies the "syntax" of the input
file and how the scanned components from the file are outputted
to the output file. The codes used for the format-file are
described under.
Example
[STA]="Oms i
går"
[END]="<!-- /BORSDATA -->"
[SUB]="</TD>" ";"
[SUB]="</td>" ";"
[SUB]=" " ""
[SKB]="<" ">"
[NUL]="" ";"
[SIR]=";"
[EIR]=WL ";" EOF
[INP]=WL <ticker> ";" ";" ";"
";" <close> ";" <high>
";" <low> ";" <volume>
";" ";" ";" ";"
[OHD]=";" L "is reported;daily stock
transaction" L
[OUT]=<close> ";" <high> ";"
<low> ";" <ticker> ";"
<volume> ";" FILENAME(1,8) L
----- nodes -----
<ticker>=aN
<close>=n.2
<high>=n.2
<low>=n.2
<volume>=i
There are four optional execution options:
- /OT=xxx where xxx either is
a file type or a file name (incl. type). One output file
is generated if xxx is a file name. One output file per
input file is generated if xxx is a file type.
- /FF=xxx where xxx is the
path for format file(s).
- /OF=xxx where xxx is the
path for output file(s).
- /TR=
The /TR= option activates execution trace to the execution
trace file (for debugging purposes).
Default values:
- /OT=ftc
- /FF=iii where iii is the path of
the corresponding input file.
- /OF=iii where iii is the path of
the corresponding input file or the file-list file if /OT
is a file name.
Note that the format file corresponding to input file
nnnnn.ttt is ttt.fff where fff is the format-file file type.
FileScanner
Format File
The format file contains several records of various type. The
record types are listed under.
I. FORMAT-FILE FORMAT
Tag-record
Tag-record
:
----- nodes -----
Node-record
Node-record
:
II. TAG RECORDS
Tag-records have the general format [tag]=tag_definition
tag_definition
.
Scanning is carried according to the following steps:
- Preprocessing of the
file including optionally removal of front- and
tail-parts, deletion of bracketed sub-strings and
string substitutions.
- Optional scanning up
to first input record.
- Checking for
end-of-input-records condition.
- if not
end-of-records
- Scanning
of input record.
- Generation
of output record.
- Repeat
from 3.
- if
end-of-records
- Exit
Preprocessing:
[END] = string
Directs the scanner to
ignore all text after (and including) the first occurrence of
the given strings. The string may or may not be present in
the scanned file.
[SKB] = string ...
Skips all text with the
bracket strings (including the bracket strings). The first
string specifies the opening bracket, while the second string
specifies the closing bracket.
Note!
It is assumed (i) that there is always a right bracket
matching a left bracket, and (ii) that brackets are not
nested.
[STA] = string
Directs the scanner to
ignore all text prior to (and including) the first occurrence
of given string. The string may or may not be present in the
scanned file.
[SUB] = string ...
Substitutes all occurrences
of the first string with the second string.
Global settings used in
input record scanning:
[ERR] = string ...
Directs the scanner to
recognize the given strings as error values. Records with
error values are not outputted, and a warning is written in
the log file.
[NUL] = string string ...
Directs the scanner to
regard the first string in each string pair as null values.
The second string in the pairs must be the string following
the null value in the input text. Null values actually used
when null value strings are encountered are "0" (i)
"0.0" (n), "" (a) and the date
01-Jan-1900.
Input record scanning:
[SIR] = input_field ... (any
of the following: string, W, L, WL)
Scans the specified fields
for the purpose of reaching the start of the input records.
[EIR] = input_field ... (any of
the following: string, W, L, WL, EOF)
Specifies a field pattern
which signifies the end of the input records (only specified
if the end of input records are before end-of file after
pre-processing).
[INP] = input field input field
Defines the format of what
is considered the file's (input) record format. There will
normally be many records in the file.
[END] = input field input field
...
Defines the format of what
constitues the end of all input records (this is end-of-file
by default).
[OUT] = output field output
field
Defines a footer-record
which will be generated after all the output records.
Output record formatting:
[OHD] = output field output
field
Defines the format of the
output header record. One output header record will be
outputted before the output records.
[OUT] = output field output
field
Defines the format of the
output record. One output record will be generated for each
input record.
input field
- string (scans up to
and including the first occurrence of the given string -
the given string is enclosed in quotes)
- node (scans a named
node)
- L (scans up to and
including first line break)
- W (scans up to end
of white space (excl. line breaks) (if any))
- WL (scans up to end
of white space (incl. line breaks) (if any))
output field
- string (outputs
given string)
- node (outputs a
named node)
- L (outputs a line
break)
- Q (outputs a double
quote)
- FILENAME(n,m) (outputs the
entire (or portion of) the scanned file's filename (i.e.
excluding path and type). n is the start position
(1,2,..) and m is the length (or 0 if up to the end of
the name))
- FILEPATH(n,m) (outputs the
entire (or portion of) the scanned file's path. n is the
start position (1,2,..) and m is the length (or 0 if up
to the end of the name)).
- FILETYPE(n,m) (outputs the
entire (or portion of) the scanned file's filetype. n is
the start position (1,2,..) and m is the length (or 0 if
up to the end of the name).
other fields
III. NODE RECORDS
A node record defines a node:
<node_name> = format
- a (examples: a9 a14 aN)
- n scans n positions
- N scans up to the start
of next input field which must be a string
- d (examples: dDD_MM_YY
dDD_MMM_YYYY)
- DD_MM_YY
- DD_MMM_YY
- DD_MM_YYYY
- DD_MMM_YYYY
- i (examples: iB i. i, i.5)
- B ignore blanks between
digits
- . optional periods in the
integer (999.999)
- , optional commas in the
integer (999,999)
- .n n character wide field
with optional periods in the integer (999.999)
- ,n n character wide field
with optional commas in the integer (999,999)
- i
scans up to first non-blank and non-digit character
- nus_numeric_format (999,999.99)
(examples: n.3 n.12)
- .n number with n (0,1,..)
decimals after decimal point (.) and optional
commas in the integer-part
- wnorwegian_numeric_format
(999.999,99) (examples: w.0 w.12)
- .n number with n (0,1,..)
decimals after decimal point (,) and optional
periods in the integer-part
|