Announcement

**csda1** · 01-19-2010, 05:30 PM

Re: Help with FileFind.GREP() Expression

Hi James,

Originally posted by jhackney View Post

Im performing a filefind.grep function on a directory that contains several text files to return the file name(s) of the files that match the words contained in a regular expression (see code below). I have worked the expression 100 ways and cannot retrieve a match that includes search text...other text...search text. I can successfully run the grep on either the beginning string or the ending string, but not both. In addition, I use a program called Regex Buddy that confirms that my expression matches the files in question. Where is my expression going wrong? And why do they call them regular expressions? There is nothing regular about them. Any help is appreciated!

Code:

STRING A = word(linetxt,1,",")
STRING B = word(linetxt,3,",")
sexpr = STRING A +"\(.*\r\n\)*.*"+ STRING B
delete expression_result
expression_result = (filefind.grep("*.txt", sexpr , 0 , "$(Filename), $(stop)","FI"))

A couple of things 1st. STRING A should be STRING_A, same for STRING B. Do not use spaces in variable names, field names, function names etc. Alpha will sometimes allow it, but it is aking for trouble, as many times the space will be converted to an underscore. Stick to underscore, and alphanumerics for names.

Deleting a variable is also a bad idea, and all variables should be DIM'd. See my tips here.

Also, as far as I know, "F" is not an option for grep expressions, but can be used in the return string

Now to your grep expression. I don't know what you are trying to find in the text. Is it a string 1 followed by some other text you specified, or just any text, and then followed by string 1 again? Also, do you want the text to be matched to all be on 1 line? On the chance that the strings might contain regex special characters, they should be escaped if you don't know their contents.

This works for me

Code:

STRING_A = regex_escape(word(linetxt,1,","))
STRING_B = regex_escape(word(linetxt,3,","))
'sexpr = STRING_A +"\(.*\r\n\)*.*"+ STRING_B
sexpr=".*"+STRING_A+".*"+STRING_B
dim expression_result as c
expression_result = filefind.grep("*.txt", sexpr , 0 , "$(Filename)$(stop),","I")

**jhackney** · 01-20-2010, 01:08 AM

Re: Help with FileFind.GREP() Expression

CSDA, Thanks for the quick reply. String A and String B are not the real variables. I just used the lingo for presentation purposes. I'm pretty old school when it comes to developing my syntax enough so that I'm not even comfortable using underscore. Sorry for the confusion in my submission.

I can't tell you why the "F" is in there other than it represents the frustration I have had over the last 48 hours trying to nail down this expression. It could also represent the "F" explatives I was throwing around at the screen although i don't recall seing this command in the POSSIX standards other than the reason you mentioned.

What I am trying to do is develop a broad expression that will scan through approx 1,000 "semi-structured" text documents. Some documents will only be 1 page while others may reach 10. The data I am looking for is likely to be on the first page but can be anywhere from the first line to the last line. The goal is to match a character string that represents a company's name (string a) and then, a character string that represents a company's address (string b). The address may be on the line that follows or at the bottom of the page. B/C the documents are not structured like a form, there will never be a consistent location for the strings. As such, I also plan to match the 2 strings in reverse just in case. The text in between the strings we must plan for may contain any and all possible word characters plus whitespace, tab, CRLF, etc... When we are done, we are only interested in the documents where the company name and address matches our strings. One will not due whithout the other.

I am going to try your suggestion and I will let you know what the results are. Until then, I hope I have provided a good explanation of the expressions purpose.

Thanks again!

**G Gabriel** · 01-20-2010, 03:10 AM

Re: Help with FileFind.GREP() Expression

You certainly could do this with regex, but it might be an overkill. You could use something as simple as contains() or $:

Code:

v_company="Alpha Software"
v_address="70 Blanchard Road"
Text=<<%str%
#Alpha Software, Inc. makes wonderful software called alpha5.
#They are located at:
#70 Blanchard Road
#Burlington, MA 01803-5100
#%str%
?(contains(text,v_company)).and.(contains(text,v_address))
= .T.
?(contains(text,"Alpha*")).and.(contains(text,"70 Blanchard*"))
= .T.
?(v_company $ text).and.(v_address $ text)
= .T.

**jhackney** · 01-20-2010, 01:28 PM

Re: Help with FileFind.GREP() Expression

It's nice to hear from you again Gabriel. It has been a while.

When I started this script I was headed down the same path you suggested. After an hour or so I pulled out a calculator to estimate the total number of iterations I would be performing. Assuming approx 1,000 text documents to review and approx 400 search combinations, the script would execute approx 400,000 iterations. Using filefind.grep() I can effectively combine the 1,000 text documents into 1 search target and cut the iterations down to 400. The grep method appears to be the most efficient way to handle my problem, especially when you consider you don't have to actually get any of the text documents before they are processed.

Please correct me if you do not think this is the case or if there is another way for me to keep the processing iterations limited using another method.

**csda1** · 01-20-2010, 01:56 PM

Re: Help with FileFind.GREP() Expression

Originally posted by jhackney View Post

When I started this script I was headed down the same path you suggested. After an hour or so I pulled out a calculator to estimate the total number of iterations I would be performing. Assuming approx 1,000 text documents to review and approx 400 search combinations, the script would execute approx 400,000 iterations. Using filefind.grep() I can effectively combine the 1,000 text documents into 1 search target and cut the iterations down to 400. The grep method appears to be the most efficient way to handle my problem, especially when you consider you don't have to actually get any of the text documents before they are processed.

Please correct me if you do not think this is the case or if there is another way for me to keep the processing iterations limited using another method.

If you are running the search for many criteria, you are paying a high disk or network overhead if you are rereading the file many times (400 in the case you are outlined). It would be better to read a file in, process it for the 400 searches you need to do (Can you stop searching once you found a matching one?), and then move on to the next file.

Depending on total size of all the text files together, you could even bring them all into main memory of an Alpha 5 variable. If properly indexed to the start and end line of each, one search could instantaneously find 1 or more occurrences within the text, and identify the file associated with the line. This would be fastest, as you are reading the files once, and searching for each criteria once.

The lower level string searching commands and regex are the fastest way to find matching strings.

**G Gabriel** · 01-20-2010, 02:06 PM

Re: Help with FileFind.GREP() Expression

James
Your math is not exactly correct.
Say you have 400 combos of company/address and say you have 1000 documents to search.
The first question really is: could one document have more than one combo?
I am guessing the answer is no. If so, once a combo is found you move on to the next one.
When you start searching a document for these combos, the probability of iterations range from 1 to 400. So the iterations will range from 1 to 400. What is the average iteration? That depends on which combos you search for first. If you sort these combos from the most to the least common, you could cut these iterations significantly.
regex will not change that at all, that is unless these companies have something in common that you incorporate in your format and even if they do, the addresses cannot possibly have anything in common.

**jhackney** · 01-20-2010, 04:26 PM

Re: Help with FileFind.GREP() Expression

I actually got it to work!:D and may have found a v10 bug. If you set the regex option to "I" you must also include the default flag "S". Otherwise, the function will not work even though "S" is the default and the regex option is optional.

Iv'e inserted the code below. I really like the filefind.grep() method. I just completed a search against my test subject of 530 text documents using 422 search combinations of the variables inserted below. The script performed the search and exported the result criteria into a table in approximately 75 seconds. That's 0.18 seconds per search combination. I'm impressed.

The POSIX regular expression I ended up using a.> finds the 1st string I am searching for, then, b.> matches all text to the end of the file regardless of the number of lines, special characters, and word characters included, then, c.> searches backwards until it finds the 2nd string I am looking for. If both of the strings are matched, the file name is returned.

For those of you that enjoy working with expressions as much as I do, I use a $40 program called regex buddy to create my search expressions. I use this instead of the UI included in Alpha and highly recommend it to anyone. In addition to helping with the expression, it has a built in grep routine that will evaluate the expression against a file or directory of files. It also has a real time testing UI that displays the results as you type against a test string.

Code:

'dims and preceding code not shown'
coname = word(linetxt,1,",")
conadd = word(linetxt, 3, ",")
coid = word(linetxt,4,",")
sexpr = "[COLOR=blue][B]\("+coname+"\)[[:cntrl:][:print:]]*\("+conadd+"\)[/B][/COLOR]"
expression_result = (filefind.grep("*.txt", sexpr , 0 , "$(Filename), $(stop)" + coname + ", " + coid + crlf(),"[COLOR=blue][B]SI[/B][/COLOR]"))
If (IsNull(expression_result) = .f.) then
  tbl.populate_from_string("file_name, co_name, co_id", crlf(), expression_result)
 
 end if

Thanks to all for your help.

**G Gabriel** · 01-20-2010, 05:01 PM

Re: Help with FileFind.GREP() Expression

All is well that ends well. Just for the sake of offering other alternative to those who are avert to regex:
You could use along with what was mentioned before other functions such as at() or scansmatch()

**jhackney** · 01-20-2010, 05:10 PM

Re: Help with FileFind.GREP() Expression

IRA, I like your idea. Even though I have already fixed my issue, I would like to try what you suggested with the single file properly indexed. Is it correct to assume the process would look something like this:

1. Read all txt documents to 1 file. Will StringScanner do?
2. As each txt document is read into file write file name and corresponding placement (line numbers) that result into a seperate list to establish index.
3. Start search on 1 of 422 search combinations and as each match is found:

3a. Return filename based on index
3b. Return company name based on match
3c. Skip to start of next file based on index
3d. repeat until eof.

4. Repeat until end of 422.

I like it. Any suggestions? I will let you know what the results are. Thanks for your help.

**jhackney** · 01-20-2010, 05:49 PM

Re: Help with FileFind.GREP() Expression

Gabriel sorry for the late reply. You can tell I don't post much by the lingo I use. Also, my background is in forensic auditing which has its own dictionary. What I meant by iterations was the number of times (min) a search string hits the target. I use this as a baseline to estimate processing time before the script is compiled. In this case, each of the 1,000 documents must have at least one match from the search criteria. Unfortunately, if you use just the company name, you may get more than one match as the documents occassionally include the names of other company's. Address is used as the unique identifying criteria b/c it contains digits and letters and has the highest probability to return a unique match from the data we have. With this in mind, each search combination must be compared against each document. Even if a document has been matched, I am required to check it against the other search combinations to ensure that our logic is conclusive. therefore, 400 search compared against 1000 documents will have a minimum of 400,000 iterations. If you can combine the documents into one, or vise versa, without jepordizing the identity or flow of the original document, your iterations would drop significantly.

thanks for your help.

**csda1** · 01-20-2010, 06:25 PM

Re: Help with FileFind.GREP() Expression

Hi James,

1st, I don't think the "S" option in regex as a default is an A5V10 bug, but if you have an example that highlights a difference, that might be helpful. Typically it's a regex expression issue, not the option.

Originally posted by jhackney View Post

IRA, I like your idea. Even though I have already fixed my issue, I would like to try what you suggested with the single file properly indexed.

I've processed 65 Meg files with no problems. However, depending upon the length of the strings some string functions start to get real slow as size increases. This may predicate the requirement to search smaller pieces, but you should test.

Originally posted by jhackney View Post

Is it correct to assume the process would look something like this:

1. Read all txt documents to 1 file. Will StringScanner do?

I don't recommend StringScanner at all. I don't even think Alpha uses it internally. It is slow compared to other methods (but may have been fast in earlier versions that had no regex etc)

Originally posted by jhackney View Post

2. As each txt document is read into file write file name and corresponding placement (line numbers) that result into a seperate list to establish index.
3. Start search on 1 of 422 search combinations and as each match is found:

3a. Return filename based on index
3b. Return company name based on match
3c. Skip to start of next file based on index
3d. repeat until eof.

4. Repeat until end of 422.

I like it. Any suggestions? I will let you know what the results are. Thanks for your help.

I would recommend that for # 3 that you do a regex_grep of the entire file. If you place a unique delimiter (e.g. chr(28) See my tips here) between each file that you read, then you could search for the delimiter, any text 1, company, any text 2, address and return the line # of the match (which will be the line of the delimiter). Now convert the line number back to the name of the file. Use WORDAT() to find the line number in a list of file line number starts, and then use WORD() to extract the name of the corresponding filename placed into a filename list, and append the Company name you were searching to each item

Announcement

The Alpha Software Forum Participation Guidelines

Help with FileFind.GREP() Expression

Help with FileFind.GREP() Expression

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment