Announcement

**Ray in Capetown** · 08-22-2013, 02:52 AM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Sounds like stringdictionary would be most suitable

the StringDictionary is much faster than collections when used for very large lists. Supports only character data.

**KeithW** · 08-22-2013, 09:47 AM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Ray,

Thanx for the thought... after looking at the StringDictionary function I am not sure it would do what I need...

The Master File, field to search, might look like "term1 term2 term3 term4, ..."

The Code file is
TermA code1
TermB code2
TermC code3

etc

TermsA/B/C *could* match Term1/2/3 or may not...

so there is multiple options, whereas StringDictionary appears to want to check a single value against a list. That is why I had to use the containsi()
or "contains the word(s)" option to find matches although I would have to do them one at a time to set the proper CODE value which is why I came up
with a two dimensional array (term,Code) and thought picking the "term" and running thru the containsi() and upon match set "code" might be the way
to address the situation. I admit I have never used (nor tried StringDictionary) before so maybe I am not fully understanding its capabilities.

Keith

**Tom Cone Jr** · 08-22-2013, 10:09 AM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Wouldn't this be easier if the field with multiple terms was "split" into several fields, each containing a single term? After adding new fields this could be done with an update operation.

**KeithW** · 08-22-2013, 10:16 AM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Tom,

Why certainly, except for the original file has to be maintained, unfortunately, and checked for changes going forward on a minimum of a quarterly basis.
So I would have to disassemble, update and reassemble... was hoping not to have to do that. There is a final validation phase I have left out of the
equation so far, which needs to be done in the updated/reassembled database, just to add a bit of salt to the wound. But that for another rainy day.

Keith

**csda1** · 08-22-2013, 09:04 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Hi Keith,

Going through the table multiple times is a very inefficient method. You want to pass through the code exactly once.

Basically, you load up a string or array, and create a function to use in an update operation for the table (It may be possible to create an expanded expression, and fore go the function). The function would temporarily change the field to a CRLF() delimited words, load up 2 crlf strings from the 1st table with the items and corresponding codes, then use strtran_multi() or stritran_multi(), then change the result back to the format for the field from the crlf delimited result.

**G Gabriel** · 08-22-2013, 09:51 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Would like to help you skin this cat..unfortunately, I don't see the cat!
I have no idea what the question is?
I applaud those who responded, presumably they did understand what I failed to understand due to my limited knowledge of the animal kingdom, more specifically felines....
I vaguely see one table that has a field that is perhaps a memo or an RTF field that has the word "term" associated with some code and another table that has a field that has the same word "Term" but instead of "Term 1" in the first table, it is "Term A" in the second..or maybe the correlation does not hold..and you want to associate it with the code from the first table.. something along these lines..

Perhaps an example or a zip file might help.

**Ray in Capetown** · 08-23-2013, 05:28 AM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

I think it is meant to indicate that multiple matches TERMA. TERMB could exist in the string "TERM1TERM2TERMA...TERMB..."
In which case stringdictionary would not be suitable
I did however do a test of stritran with three comparisons against a 180 char long string with 1mill iterations (2 mill changes) as follows

Code:

?frstr 'From string
= asdfgh
kloiuy
sdfGhJ
?tostr 'to str
= marios
calugi
MASTER
chstr="asdfghjkloiuytrewqasdfghjkloiuytrewqASDFGHJKL;OPIUYTREWQASDFGHJKLOIUYTREWWSDFGHJKLOIUYTREsdfghjkloiuytreDFGYTREWEDRFGTHJKOIUYTREDFGHJKOIUYTREDFGHHGFDXCVBNMJHGFDxxcvbnmjhgfdcvbnmhhgfd"
stt=now()
for i=1 to 1000000
nstr=stritran_multi(nstr,frstr,tostr)     'first change
nstr=stritran_multi(nstr,tostr,frstr)    'Change back
next
ste=now()-stt
?ste
= 79.473 seconds

Second run 66 secs
GEEKY I know - but interested in time scale - pretty fast for the string operation itself. Add to that some file read time.

**KeithW** · 08-23-2013, 12:20 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

HI all, sorry for the delay, I was out of the country and yesterday was the return trip.

IRA, yes I agree one time thru each file, by all means, however memory will get a thrashing, lol. My thoughts, read the terms/codes in an array. Then read the Master File and on each record spin thru the array of terms/codes to see if a match exists, if so, store the codes in that Mast File record. ONCE thru each file !

Thanx for the comments everyone, to answer a few other questions/comment... it IS possible for multiple matches to occur cause the *single* field in the Master File *could* contain multiple Terms_Keys... would not be my preference, but that is the task at hand, not of my doing originally. Any records with multiple matches has to be investigated as to which CODE would be appropriate for the record, so just be cause I found a match, I cannot stop running the TERMS at that point until I exhaust that list. I will just flag that multiple hits occurred and those items will be the first to be reviewed to determine their proper coding.

Keith

**G Gabriel** · 08-23-2013, 01:18 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

What is the structure of each table?
What types of fields are these?
Are there any common fields between the tables?
What "code" are you referring to?
A code as in an item code? or an xbasic code?
On the average, how many characters are these "terms"? Do they fit any particular pattern? Ditto for the "codes"?
Perhaps others have answers to all of the above..

**csda1** · 08-23-2013, 01:51 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Hi Ray,

Basically the methodology I suggested. In version 9 and earlier, the stritran_multi() will run much slower, but should still be relatively fast.

Keith,
This is the way to do it in an update. Forget arrays, it's a dead end for your needs (in terms of speed). And you must use an update to move through the 700k records quickly (vs xbasic). If you are changing a field used in an index, read my Record Locking 101 tips and pay attention to appends/updates section

**KeithW** · 08-23-2013, 03:08 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

OK, Let me put a better example on the table to help you understand, while I cannot use the specifics for the situation due to non-disclosure issues, this is exactly the same issue using things I can talk about. All data at this point is Character Based in these examples.

I am part of a Specialized Due Diligence Audit Team that specializes in Asset Base Securitization Financing of Trade Receivables. Lots of big words there, what it means... big companies, Fortune 500 (at a minimum) and bigger, are borrowing using their A/R as collateral on the deal. There are all sorts of issues that have to be contemplated for the banks doing the financing of these deals.

One of the things we have to do is to make sure that the company is tagging their government accounts, why, well all I can say is you cannot perfect a lien over a government entity in the US, you either understand or not, does not matter and is immaterial to the rest of the conversation other than what is important is we have to be able to identify all potential governmental customer entities. Some you may not think of but it works like this....

You have a Customer Master File with things like Account_Number, Customer Name, Address, City, State, Zip, Phone, Customer Type, Credit Limit, etc yada, yada, yada. Now if the company flags their government accounts it gives us a nice starting point, but some companies do not flag them or do not even have a field in the Customer Master, for that matter, to give them a place for that info. That matters not to us as we still have to quantify potential government accounts.

So how do we go about it, usually manually, but this is going to help in the future get us out of manual mode... and look at the Customer Names to try to identify these government accounts. We look for things like: University of, Univ of, U of, City of, Town of, Dept of, AAFES, etc. Now to make things a bit messier, we have to on most deals make a list of things we believe to be Federal Entities and Municipal Entities, and some deals even made us split Municipal into State and Municipal. If only one person added Customers to the Customer Master it would be better but many companies allow branch level locations to add Customers for their area/region. One such company DOES have a procedure manual detailing how to enter Customer Names in a reasonable manner, however most of their employees fail at reading and comprehension so we get things like: Army Air Force Exchange Service, AAFES, AAFES, A.A.F.E.S., A.A.F.E. Service ... and that is just one sample of the craziness we come across. Another one is: USPO, Post Office, US Post Office, PO of, etc, think for a second of all of the possible naming options that could be employed not counting pure spelling errors.

So we developed a table, over time, that contains as many as these type of search-tags as we could come up with. Now we want to bounce that off the Customer Master, Customer Name field. Could there be multiple hits, YES, can there be false positives, YES ... but we would better those than to skip an account as we can easily declassify something tagged wrong, but do not want to understate as that is much worse as issue. While looking for "City" as in "City of Chicago", we would get "Gas City", "Party City", which are obviously NOT government entities and would have to be untagged in the final review.

So NOW, you should have a better idea of what type of matching I am trying to do... taking a list of tags (in this case potential government entities) and upon a hit, set whether it is Federal, State, or Municipal. My original situation is very similar in concept, just cannot talk about those terms.

Keith

**G Gabriel** · 08-23-2013, 08:46 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Thanks for the education on government covert operations..Thank goodness I am no fortune 500 or 5000 for that matter! I was more interested in the database technical aspects..
So, Table one is "Customer Master File".
Is that table one that you developed? a dbf table? Or one that you obtain (maybe through some covert government venues? the NSA? a guy in basement of a parking lot? or a guy in a dark alley?)
IOW, is that table consistent in terms of structure? or is it different from company to company..
Do you need to search multiple fields in that table? or just one?
Also, you said, all data is character based "
Are any of these fields a memo or an RTF? Not that it makes too much difference, just a technicality.

Just, from what you said so far, you will need a regex function

What you are trying to do is not much different than someone searching google and misspelled a word or two and google returns any and all possible matches to those words, or someone searching a dictionary and misspelled the word, or for that matter validating an email address for example..
You could go in rounds and rounds..but with this many variations seen, not to mention the unseen, you will need a regex function. That is one option.

Alternatively, and this will sound crazy at first but it is not..you do things the other way around, meaning:
Say each company has a record in the "Customer Master File". You take the field that you want to search and create a set so to speak, breaking each word or sentence into a child record that is still linked to the parent record, then it becomes a whole lot easier searching the child table.

Or, alternatively, as you mentioned earlier, in lieu of the child table, you create an array..only problem with that is (and I live in the dark ages of v8) matching words in an array didn't work well for me..but it might in the newer versions.

You can create a calc field using regex that will turn true if there is a match. The problem with that is simple: it is not doable because evidently you had to create table for all possible trigger words, meaning there is a lot of them which will exceed the expression filter character limit..but you could get around that by writing a UDF that iterates over these different variation..

P.S.
How do you suppose Larry Page got to be so rich?
BTW, it wasn't he who created this text searches..I remember using an application 30 years ago or so and I think the developer was a gentleman by the name of Nelson, who back then pioneered these searches..

**csda1** · 08-23-2013, 10:01 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

Hi Keith,

This is all very doable by just taking the source input, and then using Update to update the field (or another field) with 1 or more tags (F,S,M) based upon matching the input. If processing a lot of large tables, or repeating a lot of tables, then speed is very important. Regex, stritran, as well as other built-in functions can handle this, but coding and methodology is very important. I have a lot of items on my Tips page that discuss speed and function choices. One other thing is that if you are only looking for a finite # of tags (F,S,M), there are methods that might be faster with that small # vs an indeterminate # of tags.

As an example, a recent client had "optimized" the best that they could (and they are a very experienced a5 developer), but I used entirely different (non-obvious) methods to process the data and increased speed 7 times from 70 minutes to 10 minutes. For another client, 87 times for their code. The point is, that if you need the best possible speed, it is difficult to get it without having a lot of speed development under your belt.

Small differences can have big effects. E.g. there are some string functions (in some a5 versions), that work fast under 2k characters that slow down for larger strings. In most, but not all cases, later a5 versions of functions are either the same or faster than older a5 versions, but there are exceptions.

Another aspect to your problem is that there are ways to eliminate other false positives, as well as other aspects, e.g. taking all the 1st letters of the words or initials if there are portions that are initials, and match those initials, or looking for leading, middle or trailing matches for certain portions of matching.

So if and when you or others exceed your abilities, and assuming you actually have a budget (not always the case for many a5 self-taught developers) consider hiring an expert in that area. I see many people on the board looking to learn, and ask a lot of questions, but they seldom get the fastest solution. http://www.iadn.com lists a wide variety of A5 developers, including myself that can often help.

**KeithW** · 08-23-2013, 10:33 PM

Re: Xbasic - How to skin a cat, the best way. Matching a tag.

We are not government nor are we covert... neither are our customers which is why we have to determine just how much business they do with government entities cause they have special rules when it comes to dealing with them. If the percentage of the customers business is too heavily government favored these programs punish them for being that much concentrated in that type of business. Not my rules, we are doing an "auditing type" of function and have certain rules that we have to play by and that govern our procedures, and no we are not CPAs either.

Unfortunately, the files are different from company to company and can be different from one division to another in the same company, what really stinks is companies with multiple customer files (for different divisions) that contain commons customers and there is no linkage between them (customer numbers are different between divisions), and yes there are BIG companies that are that dis-jointed !!! I almost always for simplicity sake use DBF files and the fields are either Character or Numeric (upto 18,2) in size. Many fields that appear to be numeric in these large systems would required edit formats or otherwise since they like to maintain leading zeros. It is easier for me to just import them as character, so I do that all the time. One of the common reports we like to get our hands on out of SAP is a FBL5N report. A/R Transactions for the period you choose, open or closed items, but there are over 100 fields that can be dumped and EVERYONE gives us a different layout based on what they feel is appropriate so even the same report varies from one SAP based company to another.

I do not create any of the files, usually get a dump, in CSV, TSV, fixed length records, or Excel depending on the size and level of user gathering the info for us. If I can do what I need in Excel, then I usually just do so, however when I end up with 20,000 lines and up, I import into A5 and use a bit more power to do the tasks required. Wear the heck out of the Query Genie for many things cause it is just too easy to do so and get the job done. Then there are project like this that are not reasonable while working in any kind of manual mode. I have to customize scripts I have developed over and over all the time. I try in cases where possible to rename field-header/names as the script does care about the field order and I have imported sometimes into somewhat standardized field layout... working with customers using SAP makes it a bit easier as their reports are the most standard, but we deal with: SAP, Oracle, PeopleSoft, JD Edwards, DB2, AS400 and a slug of Legacy Systems which as you might guess only resemble each other in the file content naming (eg: Customer File) but layout & lengths and order of fields varies exponentially.

Well we are not there to correct the spelling of records in their customer file... if we see something obvious we will out of courtesy inform them so they can correct it. Some customers like and appreciate the heads-up, others are offended by having it pointed out. So for spelling errors we tread lightly until we know how they respond. If we miss an account due to a "true spelling error" so be it. We go to many lengths to allow for style of input and call it good at that point in time. We usually find many errors even at companies that make an effort to maintain the information, which some do not at all in reality. We are not hired to sit there for weeks doing a line-by-line audit, in fact the longest we spend on-site is a week and have a ton of procedures to get done in that time frame, most places is 2 to 3 days, in and out. These kinds of tasks are easily accomplished at our offices but even still we do not have so little to do that we can make a government project out of it, all puns intended !!!

I keep going back to my original array idea, read all the terms and codes into an array... then start a read loop on the master file and for each record spin the items out of the array and do a containsi() against the Customer Name field. Upon a match store the associated code for the term getting the hit. Each file is only read one time and it will be compute bound for the loop of array items and then the compute burden broken by I/O operations. If it runs for an hour... so be it, this isn't done daily or even multiple times. When I get the records tagged I need to export them for review and correction by customer company and then we move on to the next thing. I have not really heard a simpler or more straight forward approach yet. While I want to make it reasonably efficient, I want it easily modified as I was just told today I will get a chance to try it against a much bigger data-set next month, their customer file is suppose to have in excess or 7.5 million customers... might need to get an i7 notebook first, lol !!

Keith

Announcement

The Alpha Software Forum Participation Guidelines

Xbasic - How to skin a cat, the best way. Matching a tag.

Xbasic - How to skin a cat, the best way. Matching a tag.

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment

Comment