Ticket #0002189

Type	Wish	Status	submitted	Date	4-Dec-2014 16:55
Version	r3 master	Category	Parse	Submitted by	Ladislav
Platform	All	Severity	minor	Priority	normal

Summary	Define a WHITESPACE charset
Description	I think that it is useful to have it defined, it seems to be used frequently enough to justify the need.
Example code	whitespace: charset [#"^A" - #" " #"^(7F)" #"^(A0)"]

Assigned to	n/a	Fixed in	-	Last Update	4-Dec-2014 23:21

Comments
(0004544) fork 4-Dec-2014 23:21	(Hi Ladislav nice to hear from you, do check in on chat sometime if you have a moment...) It's a very crucial idea to predefine character sets, especially when advocating for the ease of use of PARSE. There has been significant discussion on how to do it. The Unicode standard actually has character classes, and it would be desirable to be able to offer sets for them: http://www.fileformat.info/info/unicode/category/index.htm The concept of defining it as a function is a nice one; it would for instance allow `whitespace` to be meaningful as well as `whitespace/ascii`. It also allows the sets to be generated and cached on demand. You could use it in FIND or PARSE or whatever... ...however it will not work with PARSE unless PARSE allows function evaluation. I added it in a PR, it's certainly possible. But at one point I thought arbitrary evaluation with function parameters would be okay if the parameters wound up inline with parse dialect code. I now agree with Carl's feeling (and others) that only zero-arity functions be allowed inline in parse code. Under that premise this would be legal: some-rule: function [/b] [ either b [[some "b"]] [[some "a"]] ] parse "aaaabbbb" [some-rule some-rule/b] While this would be rejected, and hit an error on the first attempt to use a non-zero-arity call: some-rule: function [value [char!]] [ compose [some (value)] ] parse "aaaabbbb" [some-rule #"a" some-rule #"b"] I've written up a deeper rationale behind why this is not a loss of meaningful generality--with the benefit of not making PARSE rules any more nuts than they can get already. :-) Surveys of our proposals for these classes can be found in chat search, so if you stop by we can dig up what those were. Offhand I believe we were going with `digit`, `letter`, `whitespace`, `symbol`...with refinements on each to do narrowing. so `letter/latin8/uppercase` would be more specific, while `letter` would be very general and match anything in the unicode spec that was a letter.

Comments

(0004544)
fork
4-Dec-2014 23:21

(Hi Ladislav nice to hear from you, do check in on chat sometime if you have a moment...)

It's a very crucial idea to predefine character sets, especially when advocating for the ease of use of PARSE. There has been significant discussion on how to do it. The Unicode standard actually has character classes, and it would be desirable to be able to offer sets for them:

http://www.fileformat.info/info/unicode/category/index.htm

The concept of defining it as a function is a nice one; it would for instance allow `whitespace` to be meaningful as well as `whitespace/ascii`. It also allows the sets to be generated and cached on demand. You could use it in FIND or PARSE or whatever...

...however it will not work with PARSE unless PARSE allows function evaluation. I added it in a PR, it's certainly possible. But at one point I thought arbitrary evaluation with function parameters would be okay if the parameters wound up inline with parse dialect code. I now agree with Carl's feeling (and others) that only zero-arity functions be allowed inline in parse code. Under that premise this would be legal:

some-rule: function [/b] [
either b [[some "b"]] [[some "a"]]
]

parse "aaaabbbb" [some-rule some-rule/b]

While this would be rejected, and hit an error on the first attempt to use a non-zero-arity call:

some-rule: function [value [char!]] [
compose [some (value)]
]

parse "aaaabbbb" [some-rule #"a" some-rule #"b"]

I've written up a deeper rationale behind why this is not a loss of meaningful generality--with the benefit of not making PARSE rules any more nuts than they can get already. :-)

Surveys of our proposals for these classes can be found in chat search, so if you stop by we can dig up what those were. Offhand I believe we were going with `digit`, `letter`, `whitespace`, `symbol`...with refinements on each to do narrowing. so `letter/latin8/uppercase` would be more specific, while `letter` would be very general and match anything in the unicode spec that was a letter.

Date	User	Field	Action	Change
4-Dec-2014 23:23	Fork	Comment : 0004544	Modified	-
4-Dec-2014 23:22	Fork	Comment : 0004544	Modified	-
4-Dec-2014 23:21	Fork	Comment : 0004544	Added	-
4-Dec-2014 16:55	Ladislav	Ticket	Added	-