My javascript validated the e-mail addresses, my PHP code validated them why not let mysql have a go at it too?

Here is a view for mysql (don't even think about trying to use this with MySQL < 5.0.1) that will validate email addresses with a select statement, no need for UDF or anything fancy
this is based on code by Narayana Vyas Kondreddi http://vyaskn.tripod.com/handling_email_addresses_in_sql_server.htm

SQL:
  1. CREATE VIEW `invalid_emails` AS SELECT `table_with_email_column`.`email` AS
  2. `invalidemail` FROM `table_with_email_column` WHERE ((locate(_latin1'
  3. ', ltrim(rtrim(`table_with_email_column`.`email`))) <> 0) OR
  4. (LEFT(ltrim(`table_with_email_column`.`email`), 1) = _latin1'@') OR
  5. (RIGHT(rtrim(`table_with_email_column`.`email`), 1) = _latin1'.') OR
  6. ((locate(_latin1'.', `table_with_email_column`.`email`,locate(_latin1'@', `table_with_email_column`.`email`))
  7. - locate(_latin1'@', `table_with_email_column`.`email`)) <= 1) OR
  8. ((length(ltrim(rtrim(`table_with_email_column`.`email`))) -
  9. length(REPLACE(ltrim(rtrim(`table_with_email_column`.`email`)), _latin1'@', _latin1'')))
  10. <> 1) OR
  11. (locate(_latin1'.', reverse(ltrim(rtrim(`table_with_email_column`.`email`)))) <
  12. 3) OR (locate(_latin1'.@', `table_with_email_column`.`email`) <> 0) OR
  13. (locate(_latin1'..', `table_with_email_column`.`email`) <> 0));

So you can do stuff like delete all lines from the table that have an invalid email:

SQL:
  1. DELETE FROM table_with_email_column WHERE `table_with_email_column`.`email` IN (SELECT
  2. invalidemail FROM invalid_emails);

or just select all the valid lines :

SQL:
  1. SELECT FROM table_with_email_column WHERE `table_with_email_column`.`email` NOT IN (SELECT
  2. invalidemail FROM invalid_emails);

Sadly, this view is not updateble ...

Probably this is not at all precise... so I would really like to find a nice test dataset for emails that I can use to validate the validation against the RFC (with 4 letter TLDs and the works....)

3 Responses to “Validate email addresses in Mysql with a pure select”

  1. Vincent Caron Says:

    There’s room for improvement, and I would say it’s quite an understatement.

    The _latin1() conversions are useless since literals are ASCII characters here (dot, at, letters and numbers), I don’t see any reason why it would be there. Since RFC822 mandates a clear ASCII subset for email addresses characters, we don’t even have to check if MySQL is using the right charset context (see set names ...).

    Most importantly, regular expressions were designed to actually match string patterns. They are efficient, especially if they are anchored to the beginning of a string (ie. start with a ^), and they are way much more powerful than a bunch of “locate/trim”. Not to mention the consiseness.

    Something like this will find blatantly invalid email addresses in a user table, with any MySQL 3/4/5:

    SELECT * FROM users WHERE email NOT REGEXP '^[^@]+@[^@]+\.[^@]{2,}$';

    That reads: starts with at least one character which is not @, then a @, then another non-@ non-empty sequence, then a dot, then at least two non-@ characters (ie. TLD) and end of string. This is not RFC822 conformant, this is a “tolerant” filter which checks for obvious insanity. This is what you want most of the time.

    As for triple validation (JS, PHP, MySQL): I see many risks of painful user experience while trying to pass through 3 validators which have a great chance of being not synchronized or at least rely on subtly different regex implementations. Now that we’re Ajax, we should replace client-side validation with a discrete round-trip via PHP and have a single, coherent validation test.

    I’ll just consider the MySQL validation for bulk/batch data, for example to tune a regex and see how many existing users it will tolerate or reject.

  2. Ori Pekelman Says:

    Ok probably you are right and going with REGEXP is better, but the result of this expression is much too permissive on my test dataset. Also it seems slower, my implementation invalidated 10829 rows in the set (0.86 sec), yours 190 rows (1.39 sec) from a total of 136 011 rows. So performance wise the REGEXP solutions seems less optimal.

    The following REGEXP is a bit better I believe (also matches only legal characters) :[A-Z0-9._%-]+@[A-Z0-9.-]+\.[A-Z]{2,4} (matched 263 lines in 1.39 sec). Still the result seems weird. A lot of strings that should not have matched the expression pass through. What is Myql thinking? or is this regexp not corret?

    For the triple validation you are right ofcourse it makes no sense whatsoever. I used this code to rapidly clean large datasets when importing a very messy database I had to import. For normal system use validating with ony server side code is a much better approach.

    BTW the limitation on using my sql code with mysql 4 is just the use of the VIEW the select should work (haven’t tested though).

  3. Ori Pekelman Says:

    Ok, a bit more work (and thank you jean-françois :) )
    And here is a beautiful REGEXP that matches more or less my initial code… Still it would be good to have an RFC perfect test dataset….. (anyone ?)
    [sql]
    SELECT * FROM users WHERE email NOT REGEXP ‘^[a-zA-Z0-9][a-zA-Z0-9._-]*[a-zA-Z0-9]@[a-zA-Z0-9][a-zA-Z0-9._-]*[a-zA-Z0-9]\.[a-zA-Z]{2,4}$’
    [/sql]
    Performance is still lousy : 11314 rows in set (1.97 sec)

Leave a Reply

Creative Commons License
This work is licensed under a Creative Commons Attribution 2.0 License.