Announcement

**Lenny Forziati** · 02-17-2006, 04:42 PM

I think you misunderstand what extract_all_strings() does. You are asking it to give you everything between name="ngr" and name="lonb", which it seems to be doing just fine. It is not parsing your HTML and looking at tags, words, or anything else, just ASCII content.

You need to identify the text immediately preceeding and immediately following the text you want to extract and use those as your delimiters. I don't see any common delimiters around all blocks, so you're probably going to need to makie several calls to extract_string() with different delimiters each time.

Also if everything in red is what you want to have in plain_text, you cannot use *html_to_plain(). *html_to_plain()'s purpose is to strip all HTML tags from a string.

Originally posted by Graham Wickens

I cant seem to get the "extract_all_strings" function to work properly.

here is the code:

html_text = extract_all_strings(html_text,"name=\"ngr\"","name=\"lonb\"")

plain_text = *html_to_plain(html_text)

htm.write_line(plain_text)

for example of input htm file see attached file [Input html.doc]

text in red is what I am trying to extract.

what I am getting is in attached file [Output Text.txt]

any ideas welcomed.

**Graham Wickens** · 02-17-2006, 07:30 PM

Ok Lenny, I tried your suggestion, splitting it into separate scans for each value:

html_text = extract_all_strings(html_text,"name=\"ngr\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"latd\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"latm\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"lats\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"lond\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"lonm\" value=",">","~")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

html_text = extract_all_strings(html_text,"name=\"lons\" value=",">")
plain_text = *html_to_plain(html_text)
htm.write_line(plain_text)

The first one works, gives me the value I wanted, the rest just give (CRLF) even though I defined "~" instead of (CRLF) on all scans except the last one, so that I had all the related values on one line for subsequent editing!!

**Graham Wickens** · 02-17-2006, 07:43 PM

err! I think I found the first mistake. I rerun it with the following code, but it still ignores the "~" and still inserts (CRLF) but at least it now gets the values I was after>..

html_text = File.to_string(path0+file_name[j])

html1_text = extract_all_strings(html_text,"name=\"ngr\" value=",">","~")
plain_text = *html_to_plain(html1_text)
htm.write_line(plain_text)

html2_text = extract_all_strings(html_text,"name=\"latd\" value=",">","~")
plain_text = *html_to_plain(html2_text)
htm.write_line(plain_text)

html3_text = extract_all_strings(html_text,"name=\"latm\" value=",">","~")
plain_text = *html_to_plain(html3_text)
htm.write_line(plain_text)

html4_text = extract_all_strings(html_text,"name=\"lats\" value=",">","~")
plain_text = *html_to_plain(html4_text)
htm.write_line(plain_text)

html5_text = extract_all_strings(html_text,"name=\"lond\" value=",">","~")
plain_text = *html_to_plain(html5_text)
htm.write_line(plain_text)

html6_text = extract_all_strings(html_text,"name=\"lonm\" value=",">","~")
plain_text = *html_to_plain(html6_text)
htm.write_line(plain_text)

html7_text = extract_all_strings(html_text,"name=\"lons\" value=",">")
plain_text = *html_to_plain(html7_text)
htm.write_line(plain_text)

**Lenny Forziati** · 02-17-2006, 07:48 PM

First, there's no reason for you to use extract_all_strings(), you should just be using extract_string(). The difference is that extract_all_string() returns all matches. Since you only have a single match for each set of tags, it will be more efficient to stop looking for more matches once the first is found. This is what extract_string() will do for you.

Second, even if you use extract_all_strings(), there is no need to specify the delimiter as "~". What this would do is insert a "~" between matches instead of a crlf(). But again, you'll only have a single match so you don't need a delimiter at all. You're not getting a crlf() back from extract_all_strings(), you must be getting nothing (""). Your write_line() adds a crlf().

Third, since all but your first result are plain text, there is no need to use *html_to_plain on them. If your data will also be just numbers such as your sample document, I'd remove those extra calls for efficiency.

Finally, here's what works for me. I tested this in the Interactive Window, which is a great way to experiment with the expressions instead of writing to a file, opening the file to see if it was right, then starting all over again.

Code:

dim html_text as c
html_text = get_from_file("c:\george.txt")
?extract_string(html_text,"name=\"ngr\" value=",">")
= "T&#032;26489&#032;95989"

?*html_to_plain(extract_string(html_text,"name=\"ngr\" value=",">"))
= "T2648995989"

?extract_string(html_text,"name=\"latd\" value=",">"))
= "53"

? extract_string(html_text,"name=\"latm\" value=",">")
= "0"

? extract_string(html_text,"name=\"lats\" value=",">")
= "00"

? extract_string(html_text,"name=\"lond\" value=",">")
= "06"

? extract_string(html_text,"name=\"lonm\" value=",">")
= "07"

? extract_string(html_text,"name=\"lons\" value=",">")
= "00"

**Graham Wickens** · 02-18-2006, 09:22 AM

Thanks for your assistance and patience, I finally got the data I was after with this:

FOR j = 3 TO i-1
html_text = File.to_string(path0+file_name[j])
html1_text = extract_string(html_text,"name=\"ngr\" value=",">")
grid_text = stritran(html1_text,"& # 032;","")
if ut(grid_text) = "" then
goto nextone
end if
latd_text = extract_string(html_text,"name=\"latd\" value=",">")
latm_text = extract_string(html_text,"name=\"latm\" value=",">")
lats_text = extract_string(html_text,"name=\"lats\" value=",">")
lond_text = extract_string(html_text,"name=\"lond\" value=",">")
lonm_text = extract_string(html_text,"name=\"lonm\" value=",">")
lons_text = extract_string(html_text,"name=\"lons\" value=",">")
htm.write_line(latd_text+" "+right("00"+latm_text,2)+" "+right("00"+lats_text,2)+" / "+right("000"+lond_text,3)+" "+right("00"+lonm_text,2)+" "+right("00"+lons_text,2)+" = "+left(grid_text,4)+substr(grid_text,5,3))
nextone:
statusbar.set_text("Convert HTML to Text for file "+file_name[j])
next

Announcement

The Alpha Software Forum Participation Guidelines

Guidance with extract_all_strings function

Guidance with extract_all_strings function

Comment

Comment

Comment

Comment

Comment