Wednesday, April 22, 2015

Shell: loop through multiple files and pass them to the script

Short intro

Script file ./main.sh - bash script, doing the htm/html file processing. HTML - is a folder name. Problem: pass all the htm files to the script file as parameters for processing.

Passing files as parameters

# cd HTML
# ll
-rwxr-xr-x 1 oracle root   25828 Apr 13 15:03 filter
-rwxr-xr-x 1 oracle root    9955 Apr 13 15:03 filter.c
-rw-r--r-- 1 oracle root    2142 Apr 22 08:37 loader.ctl
-rw-r--r-- 1 oracle root    9792 Apr 22 08:46 loader.log
-rwxr-xr-x 1 oracle root    2648 Apr 22 10:04 main.sh
-rw-r--r-- 1 root root     781 Apr 22 09:49 10L31JKYRF5UH4.htm
-rw-r--r-- 1 root root     641 Apr 22 09:49 10L31JUKER1WP1.htm
-rw-r--r-- 1 root root     904 Apr 22 09:49 10L31JULIT5LI3.htm
-rw-r--r-- 1 root root     858 Apr 22 09:49 10L31JUOER5GT3.htm
-rw-r--r-- 1 root root     683 Apr 22 09:49 10L31JUPEM9TH9.htm
........................
# for FILE in *.htm; do ./main.sh $FILE; done;

Tuesday, April 21, 2015

TIFF: alternatiff, npapi, Chrome 42

Sad news from Chrome

April 2015

In April 2015 (Chrome 42) NPAPI support will be disabled by default in Chrome and we will unpublish extensions requiring NPAPI plugins from the Chrome Web Store.

September 2015

In September 2015 (Chrome 45) we will remove the override and NPAPI support will be permanently removed from Chrome. Installed extensions that require NPAPI plugins will no longer be able to load those plugins.

Temporary solution

Sad news to plugin users, in my case its Alternatiff. Got 9 other plugins which are not so painful. After recent update to Chrome v42 Alternatiff plugin is not longer loaded. Check it here: chrome://plugins.
There is a temporary fix for now, found in Java page: https://java.com/en/download/faq/chrome.xml#npapichrome

Here is what you do:

1. In your URL bar, enter:
chrome://flags/#enable-npapi 
2. Click the Enable link for the Enable NPAPI configuration option.
3. Click the Relaunch button that now appears at the bottom of the configuration page.

    Monday, April 20, 2015

    SQL Loader: multiple tables, multiple problems

    Short intro

    Continuous development of Ajax Crawler importer led to data import problems. Oracle SQL Loader was used in all previous versions of my crawler, just this time its a multi-table structure in both: datafile and database. Spent almost 2 weeks on the simple subject and after someone pointed out the solution I was not able to find more then 2 references online, so hopefully its the third one for you: "position(1)" - you must reset the loader if you are importing into multiple tables even though this directive looks like a fixed length argument. Table and datafile examples are shortened, just to give you the explanation of control file.

    Tables

    Im importing invoices into three tables: inv_invoices_imp, inv_invoice_lines_imp, inv_invoice_comments_imp. Inv_invoices_imp contains invoice header, inv_invoice_lines_imp contains accounting and invoice line data, inv_invoice_comments_imp - user comments.

    Datafile

    Datafile is a HTML file, containing 6 different tables/blocks: some text, header table, some text, lines table, comments table, some text. One datafile contains one invoice data.

    SQL Loader config

    Oracle SQL Loader can read, parse and load almost any type of data. Any separation, fixed or delimiter separated, single or multiple sources and destinations. For out case we got multistructured datafile and three different destination tables. Options used: truncate table, skip rows, conditional rows, fillers, sequences, foreign keys. If you got stuck with SQL Loader loading only empty lines and having no errors in log files here is a checklist: column names, data types, missed separators, encoding. Only full list I was able to find is here, except the position(1) part.

    Full loader.ctl file

    OPTIONS (SKIP=1)
    LOAD DATA
    CHARACTERSET UTF8
    INTO TABLE inv_invoices_imp
    TRUNCATE
    --APPEND
    WHEN (1:1) = 'H'
    FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
    (dummy0 FILLER, VENDOR_NUM, ACCOUNT_NUM, VENDOR_NAME, VENDOR_ORG, INVOICE_NUM, INVOICE_DATE "to_date(:INVOICE_DATE,'MM/DD/YYYY')",
     DUE_DATE "to_date(:DUE_DATE,'MM/DD/YYYY')", VALUTA, AMOUNT "to_number(:AMOUNT,'99999999999.9999')", VALUTA_EX "to_number(:VALUTA_EX,'99999999999.9999')",
     AMOUNT_NOK "to_number(:AMOUNT_NOK,'99999999999.9999')", KID, BILAGSNR, dummy1 FILLER, dummy2 FILLER,  TAX "to_number(:TAX,'99999999999.9999')",
     dummy3 FILLER, dummy4 FILLER, dummy5 FILLER, dummy6 FILLER, dummy7 FILLER, dummy8 FILLER, dummy9 FILLER, DERESREF, dummy10 FILLER,
     dummy11 FILLER, dummy12 FILLER, INVOICE_ID EXPRESSION "INV_INVOICES_IMP_SEQ.nextval"
    )
    INTO TABLE inv_invoice_lines_imp
    TRUNCATE
    WHEN (1:1) = 'L'
    FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
    (dummy0 FILLER POSITION(1), S1, S1_NAME, S2, S3, BELOP "to_number(:BELOP, '999999999999.9999')", DESCRIPTION, VAT_ID,
     VAT_AMOUNT "to_number(:VAT_AMOUNT, '999999999999.9999')", BELOP_NOK "to_number(:BELOP_NOK, '999999999999.9999')",
     S4, S5, S6, S7, dummy1 FILLER, FAKTURAID EXPRESSION "INV_INVOICES_IMP_SEQ.currval", ID EXPRESSION "INV_INVOICE_LINES_IMP_SEQ.nextval"
    )
    INTO TABLE inv_invoice_comments_imp
    TRUNCATE
    WHEN (1:1) = 'C'
    FIELDS TERMINATED BY ';' OPTIONALLY ENCLOSED BY '"' trailing nullcols
    (dummy FILLER POSITION(1), CUSER, ACTION, CDATE, DESCR, FAKTURAID EXPRESSION "INV_INVOICES_IMP_SEQ.currval", 
     ID EXPRESSION "INV_INVOICE_COMMENTS_IMP_SEQ.nextval"
    )

    Previous problems and config explanation

    UTF8 - charset spec, I suggest you use it even though your file and database are unicode.
    FILLER - useful argument, column name going with it may not exist in database table, for the same table - there cant be duplicates, so use them like dummy1, dummy2, etc. If you dont know it yet - here you specify order of your data in a datafile using destination table columns. use FILLER on the data columns that you want to skip.
    to_date, to_number - a must use if your destination column is numeric or date, suggest importing them all as VARCHAR2 at first, then converting to desired datatype and checking them one by one.
    EXPRESSION .NEXTVAL - will mention this one, you dont have to have this one in datafile, but its essential for creating foreign key relation with  other related tables.
    POSITION(1) - hopefully its the directive you are here for. Its used twice in two related tables and placed after the first column in setup. When loading into more than one table, the position has to be reset for each table after the first one, using POSITION(1) with the first field, even though it looks like fixed length directive. If you miss this directive you will end up with nice empty table lines with sequences and foreign keys, no errors in log file. With some luck you might see "all fields were null" message - but you must be very lucky - usually because of some other related error.
    EXPRESSION .CURRVAL - not much magical, but here is how you establish relation with your parent table. Hopefully your data complexity is similar. I'm also using child tables ID sequence in control file just to show you the full view. Child ID generation is only needed if you use conventional data load path.

    Sample datafile (chopped)

    <...>
    H;Leverandørnr;Bankkontonr;Leverandørnavn;Organisasjonsnr;Fakturanr;Fakturadato;Forfallsdato;Valuta;Fakturabeløp;Valutakurs;FakturabeløpNOK;KID;Bilagsnr;Scannebatch;Duplikat;Mvabeløp;Nettobeløp;Fakturatype;Val.dok;Selskapskode;Selskap;Refusjon postnr sted;refusjon Land;Deres Ref;Refusjon navn;Refusjon adresse;
    H;40013;62190581506;TUR-RETUR AS - NO 870 989 587;870989587;105358;2/19/2015;3/1/2015;NOK;5064;1;5064;103071053583;
    80746991;;N;403;4661;1;;FT;GatoFly AS;;;;;;
    L;Konto;Kontonavn;Avdeling;Prosjekt;Beløp;Bilagstekst;MVA-kode;MVA beløp;Beløp NOK;Anlegg;Produkt;Salgssted;Kanal;Sats
    L;7135;Reisekostnader;4500;1400;220;Nye FT. opphold H.Hernes 26-28.2/1-3.3;0;0;220;;;;;0
    L;7135;Reisekostnader;4500;1400;4844;Nye FT. opphold H.Hernes 26-28.2/1-3.3;1D;358.81;4844;;;;;8
    L;Fakturahistorikk
    C;Bruker;Handling;Dato;Kommentarer
    C;BTIP Connector ;Lagret av BTIPC ;2/24/2015 11:12:12 AM ;E-invoice saved by BTIPC
    C;brigde ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;Autosirk- referanse blank
    C;brigde ;Grunnlagsdata endret ;2/24/2015 11:28:04 AM ;fakturatype-1
    C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;matchSupplierAccount. match på konto.40013
    C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;Endret flytstatus
    C;BTHANDLER ;Kommentar lagt til ;2/24/2015 11:28:04 AM ;setCompName OK.
    <...>

    As you can see data is semicolon separated, first column is destination identifier, date and number separators are visible as well. Data has some crap text lines, but does not matter now. No external ID's or references are used. Invoice line import lines ("L") can be anywhere in the file, does not matter that now they are between header and comments. First column and some others are marked as FILLER in control file.

    HTML to datafile

    Will reveal some more cards for you. Data file was an old HTML file, it was missing end-tags, using a couple of self aspired tags, data formatting was also not very handy. Here is full source of my bash script used to prepare the file for reading.

    #!/bin/sh
    file="$1"
    echo processing $file

    echo converting to unicode
    cat $file | iconv -f utf-16 -t utf-8 > "$file".out
    echo done

    echo HTML cleanup
    less "$file".out | tr ',' '.' | sed 's/ //g' | sed 's/\cM//g' | sed 's/\cW//g' | sed 's/<\/TR>/<\/TR> /g' | sed 's| sed 's/ / \n/g' | sed ':a;N;$!ba;s|\n
    sed 's/ / \n/g' > "$file".clean

    echo cleanup complete

    echo header and lines separation
    ./filter -t 2 -c 2 -f "$file".clean > "$file".tmp
    ./filter -t 3 -f "$file".clean > "$file".lines
    ./filter -t 4 -f "$file".clean > "$file".comments
    echo done separating

    echo transposing headers
    cols=2; for((i=1;i<=$cols;i++)); do awk -F ";" 'BEGIN{ORS=";";} {print $'$i'}' "$file".tmp | tr '\n' ' '; echo; done > "$file".header
    echo transposed

    echo cleanup
    rm "$file".out -rf
    rm "$file".clean -rf
    rm "$file".tmp -rf
    echo cleaned up

    echo single file
    sed -e 's/^/H;/' "$file".header > "$file".out
    sed -e 's/^/L;/' "$file".lines >> "$file".out
    sed -e 's/^/C;/' "$file".comments >> "$file".out
    echo joined

    echo sql loader start
    sqlldr schema/******@sid data="$file".out control=loader.ctl discard="$file".discard
    echo loaded

    Bash script usage is simple:
    # script.sh data_file.html

    Conversion explanations

    Conversion - my html file was encoded in utf16 so first step is to get some readable file instead of binary looking one.
    HTML cleanup - examples and more explanations are available in previous post Crawling AjAx part 2. In this case I had to add the missing end-tags, generate data separators, move some new lines forth and back to have a readable file.
    Filter - a modified HTML table selection script. Source is also available in Crawling AjAx part 2. This script picks desired table and column data from a formated HTML file.
    Transposition - new problem, header table data is vertical, lines and comments - horizontal. Have to separate header and make the data horizontal as well.
    Last steps - transposed data is joined back to a working file, each table data gets a distinctive line marker to be used with SQL Loader. Last step - SQL Loader call. You can skip the cleanup step to see the temporary working files if needed.

    Contact

    Contact me simakas[at]gmail.com for details or original source code if needed.