Alpha Software Mobile Development Tools:   Alpha Anywhere    |   Alpha TransForm subscribe to our YouTube Channel  Follow Us on LinkedIn  Follow Us on Twitter  Follow Us on Facebook

Announcement

Collapse

The Alpha Software Forum Participation Guidelines

The Alpha Software Forum is a free forum created for Alpha Software Developer Community to ask for help, exchange ideas, and share solutions. Alpha Software strives to create an environment where all members of the community can feel safe to participate. In order to ensure the Alpha Software Forum is a place where all feel welcome, forum participants are expected to behave as follows:
  • Be professional in your conduct
  • Be kind to others
  • Be constructive when giving feedback
  • Be open to new ideas and suggestions
  • Stay on topic


Be sure all comments and threads you post are respectful. Posts that contain any of the following content will be considered a violation of your agreement as a member of the Alpha Software Forum Community and will be moderated:
  • Spam.
  • Vulgar language.
  • Quotes from private conversations without permission, including pricing and other sales related discussions.
  • Personal attacks, insults, or subtle put-downs.
  • Harassment, bullying, threatening, mocking, shaming, or deriding anyone.
  • Sexist, racist, homophobic, transphobic, ableist, or otherwise discriminatory jokes and language.
  • Sexually explicit or violent material, links, or language.
  • Pirated, hacked, or copyright-infringing material.
  • Encouraging of others to engage in the above behaviors.


If a thread or post is found to contain any of the content outlined above, a moderator may choose to take one of the following actions:
  • Remove the Post or Thread - the content is removed from the forum.
  • Place the User in Moderation - all posts and new threads must be approved by a moderator before they are posted.
  • Temporarily Ban the User - user is banned from forum for a period of time.
  • Permanently Ban the User - user is permanently banned from the forum.


Moderators may also rename posts and threads if they are too generic or do not property reflect the content.

Moderators may move threads if they have been posted in the incorrect forum.

Threads/Posts questioning specific moderator decisions or actions (such as "why was a user banned?") are not allowed and will be removed.

The owners of Alpha Software Corporation (Forum Owner) reserve the right to remove, edit, move, or close any thread for any reason; or ban any forum member without notice, reason, or explanation.

Community members are encouraged to click the "Report Post" icon in the lower left of a given post if they feel the post is in violation of the rules. This will alert the Moderators to take a look.

Alpha Software Corporation may amend the guidelines from time to time and may also vary the procedures it sets out where appropriate in a particular case. Your agreement to comply with the guidelines will be deemed agreement to any changes to it.



Bonus TIPS for Successful Posting

Try a Search First
It is highly recommended that a Search be done on your topic before posting, as many questions have been answered in prior posts. As with any search engine, the shorter the search term, the more "hits" will be returned, but the more specific the search term is, the greater the relevance of those "hits". Searching for "table" might well return every message on the board while "tablesum" would greatly restrict the number of messages returned.

When you do post
First, make sure you are posting your question in the correct forum. For example, if you post an issue regarding Desktop applications on the Mobile & Browser Applications board , not only will your question not be seen by the appropriate audience, it may also be removed or relocated.

The more detail you provide about your problem or question, the more likely someone is to understand your request and be able to help. A sample database with a minimum of records (and its support files, zipped together) will make it much easier to diagnose issues with your application. Screen shots of error messages are especially helpful.

When explaining how to reproduce your problem, please be as detailed as possible. Describe every step, click-by-click and keypress-by-keypress. Otherwise when others try to duplicate your problem, they may do something slightly different and end up with different results.

A note about attachments
You may only attach one file to each message. Attachment file size is limited to 2MB. If you need to include several files, you may do so by zipping them into a single archive.

If you forgot to attach your files to your post, please do NOT create a new thread. Instead, reply to your original message and attach the file there.

When attaching screen shots, it is best to attach an image file (.BMP, .JPG, .GIF, .PNG, etc.) or a zip file of several images, as opposed to a Word document containing the screen shots. Because Word documents are prone to viruses, many message board users will not open your Word file, therefore limiting their ability to help you.

Similarly, if you are uploading a zipped archive, you should simply create a .ZIP file and not a self-extracting .EXE as many users will not run your EXE file.
See more
See less

Finding dupe records with incorrect id

Collapse
X
 
  • Filter
  • Time
  • Show
Clear All
new posts

    Finding dupe records with incorrect id

    What would be the best way to identify records that are duplicated when the id has one or two digits wrong in the bad record? In this case the SSN (9 digits). There are some with 2 or 3 digits transposed. Most are 1 digit wrong. Just a few with 2 digits wrong.

    The fields I have include: Last, First, Middle, Suffix, DOB, and SSN.

    Sometimes the name is also misspelled so doing a operation including last and first name would not be 100%. Especially, if the person gets married and changes their last name.

    Their system is DOS based and I believe because they don't properly check to see if the data is already inputted they have multiple records of the same person. They might have entered the data a year or two ago or even a month ago and then reenter the data again and I believe as long as the SSN is different it will be allowed.

    I am looking for a good way to get the data cleaned up and then address the issue of best practice of entering data.

    #2
    Re: Finding dupe records with incorrect id

    I would construct a query form with an embedded browse of the fields I need to search and have the column titles sort the records. Then you can query your records by pulling those fields you want to search in above the browse. If you put in enough of the data you need to get a subset to check in query by form mode, it will make the task a bit easier.
    Robin

    Discernment is not needed in things that differ, but in those things that appear to be the same. - Miles Sanford

    Comment


      #3
      Re: Finding dupe records with incorrect id

      My favourite subject, and I could go on about this for days discussing the transposition of Middle East and Far East names etc.
      My approach is to construct a field in the table and make it a Calculated Field. ( I have made a few assumptions that DoB is in Date format and SSN is numeric).

      The content would be;
      Trim(Last)+Trim(First)+Trim(Middle)+Trim(Suffix)+DTOC(DOB)+STR(SSN)
      Recalculate all records to populate the field.
      Sort the Calculated Field and examine the records.

      The problem you will come across identifying dupes is that DoB may empty, may be in the future, or before 1900,
      First and Last and DoB might be correct but the SSN being different will mean that you cannot use the inbuilt Mark Duplicates effectively.
      First and Last may have been transposed.

      If you have an address, then there is often something in that which can be used. People think this is an easy task but it really isn't. Not only do you have to get the data cleansed, but the volume can be awesome.
      I had a team of 29 people cleansing duplicate data for 6 months in a Social Care system.

      In the future, Dr Wayne's approach to handling potential duplicates is one way of doing it, but users will always find a way around.
      My preference is to force a lookup/find when entering a new record, and if you keep a tally of who adds dupes, (using the Audit Trail feature or your own process), I suggest you run a weekly check and return the list to the data entry people and get them cleaned up. The 29 people I had to use were dealing with a backlog of thousands so the problem was too big. Weekly cleansing usually means just a few need to be resolved.
      See our Hybrid Option here;
      https://hybridapps.example-software.com/


      Apologies to anyone I haven't managed to upset yet.
      You are held in a queue and I will get to you soon.

      Comment


        #4
        Re: Finding dupe records with incorrect id

        Hi Gary,

        As Ted implied, it's a tough problem. Depending upon all your data, what their relationships are and total number of records involved, I would suggest some of the following methods.

        1st, normalize all addresses to a standard form. This means effectively, in the US, getting the full 11 digit zipcode (last 2 digits are last digits of address number). This potentially groups records by location (physical building). Although, it is possible for people to move, it catches a lot of issues.

        2nd, normalize the DOB (Birthdate) by checking for reasonableness. E.g, all birthdates should be prior to the date the record was created (assuming you have a create date), and depending upon the data you are collecting, it might always be even older than the create date (e.g. for adults, an age might be assumed to be at least 18 years old when the record was created. It might also be assumed that the age of a person would always be between the age of 0 and 110 years on creation of the record. You also might be able to check for any birthdates that are prior to to a specific date, say 01/01/1960 because of the type of people that are in the database. You can also validate that all birthdates are never in the future from the create date. However, you also have to potentially allow for blank dates if permitted in your checking.

        3rd, all SSN numbers should be all numbers and have no alphabetic or special characters in them (maybe allowing spaces or dashes depending upon your format).

        Finally, make sure all character fields do not have leading spaces.

        Once those items are accomplished, the best way to dedup a database with your data would be to create a browse (fairly large) with all relevant fields that would allow viewing for dedup issues. You do this best by various sorts assuming some data is correct, and then looking at nearby rows. It requires multiple passes through the data. The sorts would be like this
        upper(cdate(DOB)+Left(ltrim(LAST),1))
        upper(cdate(DOB)+Left(ltrim(LAST),2))
        upper(cdate(DOB)+Left(ltrim(LAST),3))

        upper(SUBSTR(alltrim(SSN),1,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),2,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),3,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),4,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),5,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),6,3)+alltrim(LAST))
        upper(SUBSTR(alltrim(SSN),7,3)+alltrim(LAST))

        upper(SUBSTR(alltrim(SSN),1,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),2,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),3,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),4,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),5,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),6,3)+alltrim(SSN))
        upper(SUBSTR(alltrim(SSN),7,3)+alltrim(SSN))

        upper(alltrim(LAST)+alltrim(FIRST))
        upper(alltrim(FIRST)+alltrim(LAST))

        The reason the above works, is that it assumes that for any particular sort, portions of the data must be correct, but would still be enough to be fairly unique. Add that to another piece of good data and it will group the data nearby. In the above, choosing 3 adjacent characters of the SSN is what I showed, but you could do it with more or less of the numbers depending on data size and error rates. If you have extremely large data (e.g. the SSN of everyone in the US), there will be many lines of similar info, no matter what sort you do. So the process gets much more complicated. But for relatively small data (compared to the variability of fields), say 1000000, it does work.

        To kind of do it a bit more automatically, you could output all the record info that has 2 or more records based upon uniqueness of something like below, where x,y and z are numbers that returns a reasonable number of potential dups without too much extraneous records.
        upper(cdate(DOB)+Left(ltrim(LAST),x))
        SUBSTR(alltrim(SSN),x,y)
        upper(Left(ltrim(LAST),z))+SUBSTR(alltrim(SSN),x,y)
        SUBSTR(alltrim(SSN),x,y)+upper(Left(ltrim(LAST),z))

        Also, if you do have location information, some of the above passes should include the full or partial 11 digit zip to group by physical location.

        Hope this helps.
        Last edited by csda1; 04-10-2013, 08:41 AM.
        Regards,

        Ira J. Perlow
        Computer Systems Design


        CSDA A5 Products
        New - Free CSDA DiagInfo - v1.39, 30 Apr 2013
        CSDA Barcode Functions

        CSDA Code Utility
        CSDA Screen Capture


        Comment


          #5
          Re: Finding dupe records with incorrect id

          Thanks all, Ted's with his calculated field and Ira's with his SSN.

          If I had control of their application I would put in a check dupe operation. Unfortunately, they only have 2 available options for software and both are DOS base.

          What I have come up with for now is the following:
          Create two indexes
          XSSN_1) SOUNDEX(LAST) +UPPER(LEFT(FIRST,3))+LEFT(SSN,3)+RIGHT(SSN,4)
          XSSN_2) SOUNDEX(LAST) +UPPER(LEFT(FIRST,3))+RIGHT(SSN,6)
          XSSN_3) SOUNDEX(LAST) +UPPER(LEFT(FIRST,3))+LEFT(SSN,5)

          I used soundex due to the fact that there are some instances with the last name spelled wrong and likely match with slight misspellings. It won't catch new married names or getting the front end of the name wrong.

          I included minimal characters of first name due to the fact that twins are likely to have almost identical SSN's. This probably won't work when data entry might misspell first name. Usually involves wrong vowels, letters repeated, or switched.

          I am working on the right formula for different SSN combo's.

          Then the following update operation with the result going into a field to be used for the report.
          str(dbcount("table","XSSN_1",soundex(LAST) +upper(left(first,3)) +left(SSN,3) +right(SSN,4)) +dbcount("table","XSSN_2",soundex(LAST) +upper(left(first,3))+right(SSN,6)) +dbcount("table","XSSN_3",soundex(LAST) +upper(left(first,3)) +left(SSN,5)) ,1)

          I generate a report for any records that has a result of more than 3 in the field.

          It appears that it captures all or almost all dupes.

          Comment


            #6
            Re: Finding dupe records with incorrect id

            Hi Gary,

            I did forget to mention, but you've obviously thought about, is that people from the same area (and twins as you have stated) typically have more similar SSN #'s at the beginning digits. Thus it is better to look at portions of the later digits as a more randomized number, and thus more likely to be unique enough to provide deduping if error free.

            Be careful with soundex. It's not that great, but can be used as 1 way.

            Also, most of the times, the 1 character of the 1st and last names (sometimes it's actually the 1st phonetic sound) is generally more correct in most tables.

            If information is typed in from a verbal communication (e.g. a call center). But there are also typical spelling and hearing errors that may effect this, e.g. B, D and P are real close, 5 and 9, etc.
            Regards,

            Ira J. Perlow
            Computer Systems Design


            CSDA A5 Products
            New - Free CSDA DiagInfo - v1.39, 30 Apr 2013
            CSDA Barcode Functions

            CSDA Code Utility
            CSDA Screen Capture


            Comment


              #7
              Re: Finding dupe records with incorrect id

              Thanks Ira,

              They don't have to worry about call center setting. But when the person filling out the form does it in script or near script and can be difficult to decipher too. The only time I think they might have the hearing issue is when they call to update their address or phone.

              The main problem is initial entry and then they don't check to see if already in.

              Comment

              Working...
              X