REBOL3 tracker
  0.9.12 beta
Ticket #0001916 User: anonymous

Project:



rss
TypeBug Statussubmitted Date15-Dec-2012 11:16
Version2.101.0 CategoryNative Submitted byBrianH
PlatformAll Severitymajor Prioritynormal

Summary TRANSCODE function API model needs revamp
Description I've examined all of the code that currently uses TRANSCODE to determine behavioral patterns; it's a low-level function and we said that we'd reevaluate its model once we had more data. Based on this, it was notable that TRANSCODE was never used without the /next or /only options. On occasions where you could use TRANSCODE without those options, TO BLOCK! was used instead. I think I've figured out why.

Right now, TRANSCODE returns a block value, with the continuation (the source binary at the position after the decoded portion of the code) appended to the end of the block. If you are using TRANSCODE /only or /next, that returned block only has one value in it, plus the continuation, always a two-value return block. And almost always that two-value block is passed to SET/ANY or SET [var1: var2:] (using set-words for FUNCT). In all cases, the return block is discarded.

Having the return value TRANSCODE with those options be passed to SET block makes sense: fixed-length return blocks with particular values in predictable positions is what SET block was made for, and this is Rebol's most common high-level code multi-value return method. For high-level code, making an extra intermediate block is worth the convenience, even if its immediately thrown away.

On the other hand, if you use TRANSCODE without the options you end up with a result that is not only not usable by SET, because of its unpredictable length, it's not usable at all with any convenience because there's an extra value tacked on the end of the block. To use the block, you have to save the last value in some variable (if you need it at all), and then do a CLEAR BACK TAIL on the rest of the block to make it useful. There's nothing convenient about that.

So, TRANSCODE returns a block which contains useful values, but is not itself useful when you use the /next or /only options, or is so unnecessarily awkward to use when you don't use the options that the function is never used without them. Not very Rebol-like.

TRANSCODE needs to return its value and continuation in a usable form and predictable location every time, and it needs to be as efficient about it as possible, both in development of code using TRANSCODE and in execution. This means not appending a continuation to the block of values when you don't use /only or /next, because that has to be undone before the value block is usable. It means thinking of the /next or /only returns as two values, rather than as a block with one value in it then another unrelated one added. Rebol-style multi-value return.

There are four proposed models below. Two with intermediate wrapper-blocks, suitable for use with SET; one of those with the wrapper-block being optional, only returning the values block when you don't specify the option. Two that are passed a word that is set to the continuation, or none if you want to ignore it, with no intermediate block needed; one of those with that word argument being an option. I'll reserve my opinions about which is better for the comments.

This will also make TRANSCODE/part practical (see #1915), which lowers the overhead of embedded scripts and Rebol template languages like RSP. I expect that will be the most common use of TRANSCODE outside of the LOAD infrastructure.
Example code
; Current behavior examples (mostly adapted from sys-load.r)
set/any [code: end:] transcode/only data   ; Being safe, the most common use
set [code: end:] transcode/next data       ; We want unset to trigger errors
decompress first transcode/next data       ; Don't care about end position
code: head clear back tail transcode data  ; Don't care about end position, no options
code: transcode data  end: last code  clear back tail code
  ; What you have to do to use transcode without options and get the end position
  ; People just use to block! instead, as I will with the next two examples
code: to block! copy/part data end        ; Partial transcode, will be very common, note the copy overhead (see #1915)
code: to block! copy/part data len  end: skip data len
  ; Partial transcode where you must save the end position, rare because it's usually already known


; 1. Proposed spec with intermediate wrapper-block tweak:
transcode: native [
	"Translates UTF-8 binary source to values. Returns [[values] binary]."   ;
	source [binary!] "Must be UTF-8 encoded"
	/part length [binary! integer!] "Length of source to translate"  ; See #1915
	/next "Next complete value (blocks as single value) as [value binary]"
	/only "Only a single value (blocks dissected) as [value binary]"
	/error "Do not throw errors - return error object as value in place"
]

; Examples
set/any [code: end:] transcode/only data  ; Same
set [code: end:] transcode/next data      ; Same
decompress first transcode/next data      ; Same
code: first transcode data                ; Clean code
set [code: end:] transcode data           ; Clean code
code: first transcode/part data end       ; No copy overhead, clean code
set [code: end:] transcode/part data len  ; No copy overhead, clean code


; 2. Proposed spec with intermediate wrapper-block option:
transcode: native [
	"Translates UTF-8 binary source to block of values."
	source [binary!] "Must be UTF-8 encoded"
	/part length [binary! integer!] "Length of source to translate"  ; See #1915
	/next "Next complete value (blocks as single value) as [value binary]"  ; /cont implied
	/only "Only a single value (blocks dissected) as [value binary]"        ; /cont implied
	/error "Do not throw errors - return error object as value in place"
	/cont "Return source at position after values also, as [[values] binary]"
]

; Examples
set/any [code: end:] transcode/only data       ; Same
set [code: end:] transcode/next data           ; Same
decompress first transcode/next data           ; Same
code: transcode data                           ; No intermediate block, cleanest code
set [code: end:] transcode/cont data           ; Clean code, /cont option needed
code: transcode/part data end                  ; No copy overhead or intermediate block, cleanest code
set [code: end:] transcode/cont/part data len  ; No copy overhead, /cont option needed


; 3. Proposed spec with continuation-word argument:
transcode: native [
	"Translates UTF-8 binary source to [values]. Sets word to end position."
	source [binary!] "Must be UTF-8 encoded"
	after [word! none!] "Word set to source position after decoded values"
	/part length [binary! integer!] "Length of source to decode"  ; See #1915
	/next "Translate next complete value (blocks as single value)"
	/only "Translate only a single value (blocks dissected)"
	/error "Do not throw errors - return error object as value in place"
]

; Examples
set/any 'code transcode/only data 'end  ; No intermediate block, still have to set/any, no set-words for funct
code: transcode/next data 'end          ; No intermediate block, simpler
decompress transcode/next data none     ; No intermediate block, none means not interested in end position
code: transcode data none               ; No intermediate block, simpler
code: transcode data 'end               ; No intermediate block, simpler
code: transcode/part data none end      ; No copy overhead or intermediate block, none in weird location
code: transcode/part data 'end len      ; No copy overhead or intermediate block, 'end in weird location, simplest code for this


; 4. Proposed spec with continuation-word option:
transcode: native [
	"Translates UTF-8 binary source to block of values."
	source [binary!] "Must be UTF-8 encoded"
	/part length [binary! integer!] "Length of source to decode"  ; See #1915
	/next "Translate next complete value (blocks as single value)"
	/only "Translate only a single value (blocks dissected)"
	/error "Do not throw errors - return error object as value in place"
	/then "Save the position after the decoded values"  ; Put last to encourage using it last, which looks better
	after [word! none!] "Word set to source at position"
]

; Examples
set/any 'code transcode/only/then data 'end  ; No intermediate block, still have to set/any, no set-words for funct
code: transcode/next/then data 'end          ; No intermediate block
decompress transcode/next data               ; No intermediate block, simplest code for this
code: transcode data                         ; No intermediate block, cleanest code
code: transcode/then data 'end               ; No intermediate block
code: transcode/part data end                ; No copy overhead or intermediate block, cleanest code
code: transcode/part/then data len 'end      ; No copy overhead or intermediate block, 'end not in weird location

Assigned ton/a Fixed in- Last Update3-Jul-2013 01:46


Comments
(0003268)
BrianH
15-Dec-2012 11:17

First, some basics about the common uses of TRANSCODE.

The most common current use is with the /next or /only options, with the results most often going to SET/any. Not SET, because TRANSCODE with those options tends to be in more low-level code where people are more careful with error triggering. With the /next or /only options, you almost always need the continuation too; ignoring it is rare since incremental translation is the main use of those options.

Using TRANSCODE to do a translation of more than just the first value is rare. A full source translation can be currently done easier with TO BLOCK!. TRANSCODE/error allows incremental translation with possible recovery from errors, but that is a really difficult task that noone has taken on yet. Nonetheless, TRANSCODE/error really needs that continuation if you want to have a hope of recovering.

The big win will come with TRANSCODE/part (#1915), because that solves a real problem that TO BLOCK! can't without full source copy overhead. That probably doesn't need the continuation set since you know the offset ahead of time, and can get a reference to the source at that offset whenever you want. However, this is a case where the relative overhead of an intermediate wrapper-block is trivial.

Now, for the proposals, by the numbers.

Proposal 3 is likely to have the least overhead in the function itself, closely followed by the rest. Proposal 2 optimizes for the most common usage patterns, but you would have to have /next, /only or /error all imply /cont, or else you'll have to specify /cont on most of the common cases, like you have to with proposal 4's /then option. Plus, the option processing overhead. And there's no decent name for the /cont option, afaict. You're better off without the options, sticking with fixed behavior.

That leaves proposals 1 and 3, the intermediate-block tweak and the continuation-word argument, no options. TRANSCODE is low-level enough that we can get away with a weird proposal like 3 if the overhead is lowered enough. Proposal 1 is the closest to being Rebol-like, leading to a greater likelihood that people can understand the code you would write to that option; it would require the fewest changes to the TRANSCODE function and none at all to existing code that runs on it, but bring us huge benefits when we implement the /part option.

I'll say that I prefer proposal 1, with the willingness to switch to 3 if it brings us overhead reductions that are big enough to outweigh the feeling that you're programming in Pascal. I would be more than happy to find out that proposal 1 was more efficient that 3 though.
(0003269)
BrianH
15-Dec-2012 19:52

Severity of major because it's a behavioral change that may require code changes, how much code depending on which proposal we do. For proposal 1 or 2 no known existing code will need changing.
(0003430)
BrianH
6-Feb-2013 08:31

Upon reviewing the native code, it looks like the intermediate block options are looking worse. The values end up having to be passed to SET or SET/any block!, which has more overhead than SET or SET/any word! or set-word assignment. This means that the overhead of setting the word has been moved to somewhere less efficient.

As for the word-passing proposals, it looks like the cost of processing the arguments of proposals 3 and 4 would be the same. Given how the code looks between 3 and 4, 4 looks a bit better, at the cost of one more slot in the stack frame.
(0003595)
fork
7-Mar-2013 20:52

I spent a fair bit of quality time with Red's lexer, which takes">https://github.com/dockimbel/Red/blob/master/red/lexer.r">takes binary UTF-8 input and processes it in the PARSE dialect. So once I understood what TRANSCODE did, it jumped out to me that this was a generally useful thing for a PARSE of a binary! to be able to do.

e.g. Just as it is useful to write:

parse ["Hello" 10 20] [string! copy value 2 integer!]

It could be useful to write:

value: []
parse rejoin [#{FFFF} to-binary "{Hello} 12-Dec-2012" #{0000}] [2 #{FF} transcode value 2 #{0000}]


I can imagine scenarios where binary wire formats might have a bunch of stuff surrounding a little pocket of UTF-8 encoded Rebol that was slipped in with to binary! mold code. It's a nice package for the functionality to come in, because if you care about the update in position you can capture it with a set-word!, and if you don't particularly need that position you can just keep going.

What the dialect format should be in lieu of the /part or /only I'm not sure. (e.g. how to specify the refinements; one could use refinement syntax without actually doing refinement lookup, but I don't know the impact). I gave an example of transcode value 2 to effectively mean:

value: []
parse rejoin [#{FFFF} to-binary "{Hello} 12-Dec-2012" #{0000}] [
2 #{FF}
pos:
(
value: transcode/part pos
newpos: last value
take back tail value
)
:newpos
#{0000}
]


Then transcode value to end could have the default behavior of going to the end, etc. But I've not designed parse stuff before so I don't know the tradeoffs here.

This looks to be the shape of what's going on, and it could be a nice general parse feature. I proposed that if there is some really common case of the shape of a block for the most common invocation of TRANSCODE, then PARSE could be finessed so it recognized that particular construction and short circuited the parse engine to optimize for it. e.g. if the loader/etc. had a very specific form of call like:

value: []
parse bin-input [transcode value 1 nextpos:]


The parse native could simply go "Hey, is the length of the rule block 4? Is the first symbol 'transcode? Is the second a word!? Is the third the integer 1? Is the fourth symbol a set-word?" If those things match it goes straight to work. You might even find it turns out faster doing that than running through the refinement lookup for the original function. So before rejecting this idea out of hand because it "won't perform" let's consider if the design makes sense... it looks a lot cleaner to me.
(0003596)
BrianH
7-Mar-2013 21:50

Sounds like a good idea in principle, Fork.

The API model of the PARSE operation could use some work though, because TRANSCODE can do many things depending on its options, but each PARSE operation can only do one thing with no options because we can't use path expressions (for various reasons we don't need to go into here). So in order to get the full benefit of TRANSCODE, we'd need to add or extend multiple operations. And we need to consider that TRANSCODE currently is optimized to work on binaries and doesn't include a string parser at all.

For TRANSCODE/next, we could just extend the datatype/typeset operations from block parsing to binary parsing, where the binary source would be interpreted as UTF-8 Rebol syntax. One whole value would be grabbed then type-checked against the datatype or typeset specified. R2 had something similar for matching some datatypes, but we could go all the way and make the SET, COPY, RETURN and QUOTE operations work here as well. SET would set the word to the constructed value, RETURN would return the constructed value, COPY would set the word to a copy of the matched portion of the source. QUOTE would transcode one value then compare it to the literal value provided using the block parse QUOTE rules.

We wouldn't need TRANSCODE without options because SOME and ANY of our incremental transcode operations would integrate better.

We wouldn't need TRANSCODE/error because we're already doing incremental parsing, so simply failing at the point where the match fails would be enough for us to backtrack and try an alternate parse rule.

We could use Carl's proposed LIMIT operation to implement /part.

The only tricky one would be TRANSCODE/only, but I think that we might be able to have INTO do this one.

And to extend this to string parsing, we'd have to write a version of TRANSCODE that works on strings (which could have other benefits).

Does this make sense? It would be a lot of work, but maybe some in the community would be up for it. We'd still need the TRANSCODE function itself, but the actual parsing code could be shared with PARSE.
(0003879)
BrianH
3-Jul-2013 01:46

Proposal to add the functionality of TRANSCODE to PARSE in #2035.

Date User Field Action Change
3-Jul-2013 01:46 BrianH Comment : 0003879 Added -
7-Mar-2013 21:53 BrianH Comment : 0003596 Modified -
7-Mar-2013 21:50 BrianH Comment : 0003596 Added -
7-Mar-2013 20:52 fork Comment : 0003595 Added -
6-Feb-2013 08:31 BrianH Comment : 0003430 Added -
15-Dec-2012 22:07 BrianH Description Modified -
15-Dec-2012 19:52 BrianH Comment : 0003269 Added -
15-Dec-2012 19:50 BrianH Description Modified -
15-Dec-2012 19:50 BrianH Code Modified -
15-Dec-2012 19:50 BrianH Severity Modified minor => major
15-Dec-2012 11:22 BrianH Comment : 0003268 Modified -
15-Dec-2012 11:20 BrianH Comment : 0003268 Modified -
15-Dec-2012 11:17 BrianH Comment : 0003268 Added -
15-Dec-2012 11:16 BrianH Ticket Added -