REBOL3 tracker
  0.9.12 beta
Ticket #0002013 User: anonymous

Project:



rss
TypeIssue Statusreviewed Date4-Apr-2013 03:21
Versionr3 master CategoryDatatype Submitted byLadislav
PlatformAll Severitymajor Prioritynormal

Summary How shall LOAD handle "external URL strings" containing non-ascii characters?
Description The example code below examines how LOAD handles "external URL strings" when transforming them to the "internal representation".

Notice that the representation is not Unicode, it rather is a "special representation" representing one Unicode code point by a sequence of code points. The whole transformation does not lose some logic, although it looks strange at the first sight.
Example code
>> to string! load "http://a.b.c/d?e=č"
== "http://a.b.c/d?e=Ã?"
; which actually is "http://a.b.c/d?e=^(00c4)^(008d)

; I originally expected the result to be
== "http://a.b.c/d?e=č"

; or
== "http://a.b.c/d?e=%c4%8d"

; yet another variant I deemeed expectable was LOAD causing an error in such case

Assigned ton/a Fixed in- Last Update11-Jan-2016 03:34


Comments
(0003754)
BrianH
4-Apr-2013 20:48

Well, many schemes should have a way to deal with non-ASCII characters, converting them to one or more encodings when they are used. For instance, a HTTP scheme should be able to convert to one encoding for domain names (Punycode IDN) and another for the rest (UTF-8). As another example, an ODBC scheme should be able to directly handle Unicode server, table and field names, and they should not be converted to octets. I say that they should be allowed, then let the scheme handle any conversion needed, or complain when they don't have a way to do so.
(0003767)
Ladislav
5-Apr-2013 00:32

"Well, many schemes should have a way to deal with non-ASCII characters" - you are missing a couple of trivial things here:

* The ticket discusses the behaviour of LOAD, which isn't scheme-dependent at present.
* Since you are discussing scheme-dependent behaviour in this LOAD ticket, shall I understand it so that you propose LOAD to behave differently for different schemes, i.e., use different (incompatible) internal representations for URL's belonging to different schemes?
(0003777)
BrianH
5-Apr-2013 03:06

No, I still want LOAD to be scheme-independent. As mentioned in #2014, with the right internal representation of the url! type, an internal representation that was logically composed of codepoints instead of octets, then LOAD could handle characters outside of the ASCII range pretty easily. That logical representation could have a UTF-8 physical representation internally, if you like.

Actually, your example code has to string! in it, and that is another issue. Let's for a moment assume that in terms of external behavior I would like url! to continue to be part of the any-string! typeset, and that like all of the other any-string! types when you use PICK on a url! value you would want a char! (codepoint) returned, not an integer like from the binary! type. Also, let's assume to string! of a url! it would return the string equivalent of its internal data (regardless of internal encoding changes, like from UTF-8 to UCS-2). So, length? of a url! would be the same as length? to string! of that url!, and every character in that string would be the same as the corresponding character in the url.

>> to string! load "http://a.b.c/d?e=č"
; I originally expected the result to be
== "http://a.b.c/d?e=č"

Sounds good to me. Also, I would like this:

>> strict-equal? "http://a.b.c/d?e=č" to string! load "http://a.b.c/d?e=č"
== true
>> length? load "http://a.b.c/d?e=č"
== 18
>> same? length? load "http://a.b.c/d?e=č" length? "http://a.b.c/d?e=č"
== true
>> same? pick load "http://a.b.c/d?e=č" 18 pick "http://a.b.c/d?e=č" 18
== true
>> to integer! pick load "http://a.b.c/d?e=č" 18
== 269

I would not mind if MOLD generates the percent encoding, as long as these all are true:

>> mold load "http://a.b.c/d?e=č"
== "http://a.b.c/d?e=%c4%8d"
>> http://a.b.c/d?e=č
== http://a.b.c/d?e=%c4%8d
>> strict-equal? load "http://a.b.c/d?e=č" load "http://a.b.c/d?e=%c4%8d"
== true
>> length? load "http://a.b.c/d?e=%c4%8d"
== 18
>> same? pick "http://a.b.c/d?e=č" 18 pick load "http://a.b.c/d?e=%c4%8d" 18
== true
>> to integer! pick load "http://a.b.c/d?e=%c4%8d" 18
== 269

The internal model should be Unicode characters (possibly UTF-8 encoded, possibly the same encoding as string!), but it would be nice if the syntax could also match the URL RFC as stated in #1986 as long as LOAD does the verifying and decoding of the percent encoding itself.

The scheme-dependent behavior would not be done by LOAD, it would be done by the port scheme handlers when they are processing url! values that LOAD has already generated.
(0003780)
BrianH
5-Apr-2013 04:23

Note that if LOAD decodes percent encoding the way it does now for ASCII characters, it causes problems later on in schemes which can't tell URL syntax characters that were originally percent-encoded to be used as data, from ones that weren't percent-encoded and thus should be treated as syntax. We would need some kind of internal escaping of the problematic characters, so they can be reencoded appropriately by the scheme (that is the scheme-dependent behavior I was talking about).

Also, note that if LOAD doesn't decode percent encoding, it will make it more difficult to generate url! values, and in the case of Unicode characters outside of the ASCII range, lead to data corruption. We don't want people building url! values to have to handle their own UTF-8 percent encoding if we don't have to, it would lead to a lot of buggy code.

See #2014 for the discussion about how to fix the underlying data model so that we can deal with issues like this one.
(0003781)
BrianH
5-Apr-2013 04:24

This ticket is related to #482 and #1986, or perhaps a combination of the two.
(0003786)
rebolek
5-Apr-2013 10:50

I guess it should be http://curecode.org/rebol3/ticket.rsp?id=1986 ?
(0004688)
Ladislav
11-Jan-2016 03:34

In the core-tests suite.

Date User Field Action Change
11-Jan-2016 03:34 ladislav Comment : 0004688 Added -
8-Jan-2016 19:53 ladislav Status Modified submitted => reviewed
5-Apr-2013 19:07 BrianH Comment : 0003781 Modified -
5-Apr-2013 11:20 Ladislav Comment : 0003784 Removed -
5-Apr-2013 10:50 rebolek Comment : 0003786 Added -
5-Apr-2013 10:44 Ladislav Comment : 0003784 Modified -
5-Apr-2013 10:37 Ladislav Category Modified Syntax => Datatype
5-Apr-2013 10:34 Ladislav Comment : 0003784 Added -
5-Apr-2013 04:41 BrianH Comment : 0003777 Modified -
5-Apr-2013 04:36 BrianH Comment : 0003777 Modified -
5-Apr-2013 04:32 BrianH Comment : 0003777 Modified -
5-Apr-2013 04:29 BrianH Comment : 0003777 Modified -
5-Apr-2013 04:28 BrianH Comment : 0003780 Modified -
5-Apr-2013 04:25 BrianH Comment : 0003781 Modified -
5-Apr-2013 04:24 BrianH Comment : 0003781 Added -
5-Apr-2013 04:23 BrianH Comment : 0003780 Added -
5-Apr-2013 03:06 BrianH Comment : 0003777 Added -
5-Apr-2013 00:48 Ladislav Code Modified -
5-Apr-2013 00:48 Ladislav Description Modified -
5-Apr-2013 00:48 Ladislav Summary Modified How shall LOAD handle "URL strings" containing non-ascii characters? => How shall LOAD handle "external URL strings" containing non-ascii characters?
5-Apr-2013 00:32 Ladislav Comment : 0003767 Modified -
5-Apr-2013 00:32 Ladislav Comment : 0003767 Added -
4-Apr-2013 21:05 BrianH Comment : 0003754 Modified -
4-Apr-2013 20:48 BrianH Comment : 0003754 Added -
4-Apr-2013 11:54 Ladislav Type Modified Bug => Issue
4-Apr-2013 11:54 Ladislav Code Modified -
4-Apr-2013 11:54 Ladislav Description Modified -
4-Apr-2013 11:54 Ladislav Summary Modified LOAD does not handle correctly "URL strings" containing non-ascii characters => How shall LOAD handle "URL strings" containing non-ascii characters?
4-Apr-2013 11:49 Ladislav Code Modified -
4-Apr-2013 11:29 Ladislav Code Modified -
4-Apr-2013 09:59 Ladislav Summary Modified LOAD does not handle correctl "URL strings" containing non-ascii characters => LOAD does not handle correctly "URL strings" containing non-ascii characters
4-Apr-2013 09:59 Ladislav Summary Modified LOAD does not handle "URL strings" containing non-ascii characters correctly => LOAD does not handle correctl "URL strings" containing non-ascii characters
4-Apr-2013 03:41 Ladislav Code Modified -
4-Apr-2013 03:34 Ladislav Code Modified -
4-Apr-2013 03:32 Ladislav Code Modified -
4-Apr-2013 03:32 Ladislav Description Modified -
4-Apr-2013 03:21 Ladislav Severity Modified minor => major
4-Apr-2013 03:21 Ladislav Category Modified Unspecified => Syntax
4-Apr-2013 03:21 Ladislav Code Modified -
4-Apr-2013 03:21 Ladislav Description Modified -
4-Apr-2013 03:21 Ladislav Ticket Added -