Filed against every single major web browser is the bug of "unicode blindness injection" security vulnerability. In short, Unicode letters can look the same as their ascii-equivalents, but lead to a different URL (thereby permitting man-in-the-middle attacks).
My solution consists of verifying that unicode glyphs look different from ascii glyphs (yes, I like the word "glyph"). In my example screenshot, words in parentheses are entirely ascii, those preceding them have a "wrong letter:"
- the K in KDE is Cyrillic (U041A)
- the S in RULES is Cyrillic (U0405) (which I'm aware doesn't even exist in Cyrillic)
- the P in APPLE is also Cyrillic (U0420)
- the T in MICROSOFT is Greek (U03A4)
However, it does a poor job of identifying K. If this is considered useful, I may be inclined to fix it for inclusion into KDE, otherwise, I'll leave it to be abandonware as I do with everything else.
The idea is that if there's a unicode letter, and the "error report" is high enough, you might warn the user prior to visiting the page.
the IDN problem
How well does it work for identifying IDN against IDN? I mean, suppose you have an all-Greek domain name, and you inject a Cyrillic letter in the middle?
Does it catch same-script "attacks", like using "ı" or "ì" instead of "i"? (Think of www.mıcrosoft.com, that does exist) At the same time, will it say it's an attack when using those very same letters in valid domains? There are some people who do use them, and that makes words have completely different meanings, which might be considered unconfusing enough for the registrars to allow registration.
The best solution I've seen so far is to try this:
For instance, in Portuguese, we use a-z plus à, á, ã, â, é, ê, í, ó, õ, ô, ú, ü and ç. Anything other than those letters should be warned for anyone using a Portuguese localisation.
As for the highlighting of backgrounds, think of visually-impaired people. Think of other programs outside Konqueror, like KMail or Kopete (including KMail's To, Cc and Bcc in Composer!).
colored text?
Hmm, why not just render text that's from different character sets with different colors?
So you would get "https://www.p(a)ypal.com", where the (a) is red or something...
I also like the idea of issuing a warning messagebox when mixed charsets are detected.
Jason
--
KStars: A desktop planetarium for KDE
Great idea, but...
Did you also test for "false positives"? Eg. "müller" should not give a match - at least for german speaking people.
What might be an easyer approach: To define subsets of glyphs that shouldn't be mixed up and show a warning message if e.g. cyrillic and ASCII characters are being mixed in a hostname.
Conflicting encoding pairs
Maybe it would be sufficient to know which pairs of encodings have problematic glyphs and then highlight all characters from one of them if displayed in combination.
Highlighting could be something like using a different background color.
For the example müller this could either be ASCII + Latin1 or Latin1 only.
As ASCII and Latin1 do not have conflicting glyphs, there would be no highlighting in either case.
perhaps both?
Maybe doing my test (with a relatively low threshhold), but only when the charsets don't match, as you said.
whoa!
Totally insane!
(Yes, 100% indicates that it is 100% certain that something funny is going on).
really interesting
This is the best solution to the problem I have heard of yet. You could even precompute the similarities for each font, and simply store a list of characters that are similar, per-font. It would require a lot of additional work on sophisticated matching algorithms, I think. But I think it is doable and it is really the *only* way to solve the problem once and for all.
I don't really like my idea
I think a database of matches is a lot of data (in theory you would only need one for all the fonts, though).
The only way to solve the problem is to just have everyone use Latin script.
Another solution may be to indicate what classes of letters are in a string according to the unicode definition ("This string contains Latin and Cyrillic glyphs").
Are you sure that the databas
Are you sure that the database would grow that big? Are there so many different scripts that resemble the latin alphabet?
Or do the asian scripts suffer from the same problem?
er, ok
Right, it'd be for less than 65536 characters, which isn't much if only one is necessary.
I just don't want to have to make the data!
I however do now think that just checking if only one kind of script may be the best solution.