21 Mar 2018

Fixing a greenplum performance issue

The issue

Two years ago I posted an issue to greenplum project¹, targeting the long-known issue of poor performance of writable external table.

I wrote this piece of code to change the bufsize to 16M to reduce number of calls to 'curl_easy_perform', hence improving WET performance.

diff --git a/src/backend/access/external/url.c b/src/backend/access/external/url.c
index a81e687..1aaa6ba 100644
--- a/src/backend/access/external/url.c
+++ b/src/backend/access/external/url.c
@@ -2295,7 +2295,9 @@ static size_t curl_fwrite(char *buf, int nbytes, URL_FILE* file, CopyState pstat
         */
        if(!curl->out.ptr)
        {
-               const int bufsize = 64 * 1024 * sizeof(char);
+               const int bufsize = 16 * 1024 * 1024 * sizeof(char); /* Enlarge the bufsize to reduce call to curl_easy_perform,
+                                                                       hence WET performance should be improved.
+                                                                       The size itself is magic */
                MemoryContext oldcontext = CurrentMemoryContext;

                MemoryContextSwitchTo(CurTransactionContext); /* TODO: is there a better cxt to use? */

Gpfdist's buffer size should also be enlarged by using options -m 33554432.

The issue was then closed after a more robust patch accepted by greenplum team, the patch uses a GUC instead of hard-coded magic number to specify the buffersize.

Proves.

At that time the source code of Greenplum has just been released to public, and the binaries I have had was different with the source released.

Instead of recompiling the source code to verify my code, I have to use this process to prove my findings:

Suppose a table 'onefield' with only one VARCHAR field, with maxium field length 440 bytes:

\d onefield
Append-Only Columnar Table "public.onefield"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 fld    | character varying |
Checksum: t
Distributed randomly

SELECT max(length(fld)) from onefield;
 max
-----
 440
(1 row)

Time: 6134.606 ms

use regular WET(writable external table):

\d w16m_ext_1
   External table "public.w16m_ext_1"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 fld    | character varying |
Type: writable
Encoding: UTF8
Format type: text
Format options: delimiter 'off' null '\N' escape '\'
External location: gpfdist://p_etl01:16665/a16m_1.txt

Export all records in table 'onefield' to 'w16m_ext_1'

INSERT INTO w16m_ext_1 SELECT * from onefield;
INSERT 0 171602816
Time: 168113.381 ms

The Same table , but using COPY:

COPY onefield TO '/data/testing/onefield_copy';
COPY 171602816
Time: 70774.670 ms

The Same Data, but aggregate many rows into one row to make aprox. 18M line:

truncate w16m;
TRUNCATE TABLE
Time: 137.081 ms
INSERT INTO w16m SELECT  string_agg(fld, E'\n')
FROM onefield
GROUP BY ctid::int8>>18, gp_segment_id ;
INSERT 0 1344
Time: 80662.342 ms
SELECT max(length(fld)) from w16m;
   max
----------
 19412359
(1 row)

Time: 7475.346 ms

export the data to WET:

\d w16m_ext
    External table "public.w16m_ext"
 Column |       Type        | Modifiers
--------+-------------------+-----------
 fld    | character varying |
Type: writable
Encoding: UTF8
Format type: text
Format options: delimiter 'off' null '\N' escape '\'
External location: gpfdist://p_etl01:16665/a16m.txt

INSERT INTO w16m_ext SELECT * from w16m;
INSERT 0 1344
Time: 74387.568 ms

data size

-rw------- 1 gpadmin gpadmin  29G Dec 24 14:19 a16m_1.txt
-rw------- 1 gpadmin gpadmin  29G Dec 24 14:07 a16m.txt
-rw-r--r-- 1 gpadmin gpadmin  29G Dec 24 14:08 onefield_copy

Conclusion

Enlarge datasize of each 'POST request' invoked by 'curl_fwrite' will dramatically improve WET performance . From 168113.381 ms reduced to 74387.568 ms.

Upshot

Yao Yandong made a more robust patch according to my findings. Instead of hard-coded bufsize, the patch introduced a new GUC letting DBA to adjust this size. And the patch was merged to Greenplum when 4.3.7.1 was released. ²

Footnotes:

https://github.com/greenplum-db/gpdb/issues/210

http://webcache.googleusercontent.com/search?q=cache:dOgH8j2--eIJ:gpdb.docs.pivotal.io/43210/relnotes/GPDB_4371_README.html+&cd=1&hl=en&ct=clnk&gl=us

20 Mar 2018

A Log of Network Trouble Shooting

The problem

This story happend in real world, but the customer name is faked.

One of my customers, The Big Customer(in abbrev, TBC), is running our product.

One day they come to me describe the following wierd phenomeon:

They observed a lot of ip packets send from containers in one component of our product, which is a virutal machine of a cluster, to other servers without doing SNAT(MASQUERADE), the receiving server is not a member of the cluster. As a result, these packets are then getting bounced into TBC’s core network, and triggering alerts.

These containers are hosting user applications, these containers are using internal ip addresses (usually 10.254.x.x). When processes inside these containers are going to connect outside servers, the packets need to be SNATed by the hosting virtual machine, this is done by a MASQUERADE rule in iptables’ “nat” table.

On the problematic system, a lot of packets with TCP flag FIN set get send out from container with source ip unchanged, the MASQUERADE rule does not take effect. When these packets reach the server, they will be considered invalid packets due to source address unrecognized, hence RST packets are emitted towards those source addresses, in this case, the 10.254.x.x addresses, and these packets get routed into TBC’s core network and trigger alerts.

As stated by TBC’s administrator of their core networking, they have been receiving these weird alerts for a long time, and only recently can they traced down the origin of these packets, which is send by virtual machines of our product.

Analysis

Checking connection status

In container, lots of tcp-connections are in CLOSE-WAIT state.

Packet analysis using tcpdump

Packet analysis using tcpdump on the virtual machine: 12 conversations analyzed , three of them are partially captured, 5 conversations are erroneous. The summary:

SourceIP/Port	DestIP/Port	Connection status	Remark
10.254.0.54.38758	xxx.xxx.yyy.yyy.8088	Partially Captured	SYN packet not seen
xxx.xxx.xxx.xxx.39474	xxx.xxx.yyy.yyy.8088	Partially Captured, erroneous	SYN Packet not seen. FIN packet from client not masqueraded
xxx.xxx.xxx.xxx.40162	xxx.xxx.yyy.yyy.8088	No error
xxx.xxx.xxx.xxx.40888	xxx.xxx.yyy.yyy.8088	No error
xxx.xxx.xxx.xxx.41740	xxx.xxx.yyy.yyy.8088	Erroneous	FIN packet from client not masqueraded
xxx.xxx.xxx.xxx.42478	xxx.xxx.yyy.yyy.8088	Erroneous	FIN packet from client not masqueraded
xxx.xxx.xxx.xxx.43218	xxx.xxx.yyy.yyy.8088	No error
xxx.xxx.xxx.xxx.43994	xxx.xxx.yyy.yyy.8088	No error
xxx.xxx.xxx.xxx.44710	xxx.xxx.yyy.yyy.8088	Erroneous	FIN packet from client not masqueraded
xxx.xxx.xxx.xxx.45426	xxx.xxx.yyy.yyy.8088	Erroneous	FIN packet from client not masqueraded
xxx.xxx.xxx.xxx.46244	xxx.xxx.yyy.yyy.8088	No error
xxx.xxx.xxx.xxx.46964	xxx.xxx.yyy.yyy.8088	Partially Captured	FIN Packet not seen from both side.

Inspecting a typical erroneous conversation:

15:00:25.416697 IP xxx.xxx.xxx.xxx.41740 > xxx.xxx.yyy.yyy.8088: Flags [S], seq 3714847479, win 28280, options [mss 1414,sackOK,TS val 2156888755 ecr 0,nop,wscale 7], length 0
15:00:25.416995 IP xxx.xxx.yyy.yyy.8088 > xxx.xxx.xxx.xxx.41740: Flags [S.], seq 4230280871, ack 3714847480, win 28960, options [mss 1460,sackOK,TS val 823948244 ecr 2156888755,nop,wscale 7], length 0
15:00:25.417080 IP xxx.xxx.xxx.xxx.41740 > xxx.xxx.yyy.yyy.8088: Flags [.], ack 4230280872, win 221, options [nop,nop,TS val 2156888755 ecr 823948244], length 0
15:00:25.417555 IP xxx.xxx.xxx.xxx.41740 > xxx.xxx.yyy.yyy.8088: Flags [P.], seq 3714847480:3714848324, ack 4230280872, win 221, options [nop,nop,TS val 2156888755 ecr 823948244], length 844
15:00:25.417805 IP xxx.xxx.yyy.yyy.8088 > xxx.xxx.xxx.xxx.41740: Flags [.], ack 3714848324, win 240, options [nop,nop,TS val 823948245 ecr 2156888755], length 0
15:00:25.420292 IP xxx.xxx.yyy.yyy.8088 > xxx.xxx.xxx.xxx.41740: Flags [P.], seq 4230280872:4230281169, ack 3714848324, win 240, options [nop,nop,TS val 823948245 ecr 2156888755], length 297
15:00:25.420596 IP xxx.xxx.xxx.xxx.41740 > xxx.xxx.yyy.yyy.8088: Flags [.], ack 4230281169, win 230, options [nop,nop,TS val 2156888756 ecr 823948245], length 0
15:00:46.760856 IP xxx.xxx.yyy.yyy.8088 > xxx.xxx.xxx.xxx.41740: Flags [F.], seq 4230281169, ack 3714848324, win 240, options [nop,nop,TS val 823953579 ecr 2156888756], length 0

15:00:46.800503 IP xxx.xxx.xxx.xxx.41740 > xxx.xxx.yyy.yyy.8088: Flags [.], ack 4230281170, win 230, options [nop,nop,TS val 2156894101 ecr 823953579], length 0
~~~~~~~~~~~~~~~
15:03:21.948707 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156932888 ecr 823953579], length 0
~~~~~~~~~~~~~~~
15:03:22.152480 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156932939 ecr 823953579], length 0
15:03:22.356643 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156932990 ecr 823953579], length 0
15:03:22.764540 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156933092 ecr 823953579], length 0
15:03:23.583041 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156933296 ecr 823953579], length 0
15:03:25.216453 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156933705 ecr 823953579], length 0
15:03:28.492444 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156934524 ecr 823953579], length 0
15:03:35.036477 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156936160 ecr 823953579], length 0
15:03:48.124523 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156939432 ecr 823953579], length 0
15:04:14.332468 IP 10.254.0.54.41740 > xxx.xxx.yyy.yyy.8088: Flags [F.], seq 3714848324, ack 4230281170, win 230, options [nop,nop,TS val 2156945984 ecr 823953579], length 0

The evidence that all these packets are belong to the same conversation is the fact that all sequence numbers are continuous. The marked two lines show that the FIN packet from client side is sent more than 2 minutes later than the client side ACKs the FIN packet sent from server side. This is the nf_conntrack_tcp_timeout_close_wait settings on the virtual machine: this timeout is set to 60 second.

When conntrack module of the virtual machine see both server side FIN packet(SEQ 4230281169) and client’s ACK packet to the FIN (ACK 4230281170), it will mark the connect as CLOSE_WAIT state, the entry describing the connection will be deleted after 60 seconds(nf_conntrack_tcp_timeout_close_wait). At the moment the client side sends out a FIN packet after almost 2 minutes (SEQ 3714848324), the packet does not match any conntrack entry, and will be marked as an INVALID packet. SNAT(MASQUERAD in this case) rule of iptables depends on conntrack to remember each connection’s state, then INVALID packets will not be masqueraded, such packet will just be simply routed to its destination. When the server side receives these un-masqueraded packets, it does not have any information about the src address of the packets and then bounces these packet to their src address with RST flag. Client will not receive any response of the FIN packets, and retries until another timeout occurs.

Under such scenario, each of FIN packet resent by client will cause a bounced packet.

Analyze the code

The server xxx.xxx.yyy.yyy.8088 is a zabbix server gathering metrics from container. There is an agent program resides in each application container which sends metrics data every defined period. The data send is done by doing an HTTP POST request to zabbix server. After analyzing the code of this agent, it is found that the agent program does not explicitly release http response objects, which will leave the socket file open or half-open in this case. Until these objects are collected by GC of jvm, these socket files will not be closed.

Such behavior will definitely introduce a interval between server side close and client side close, this interval is the already shown interval between client ACK of Server FIN and client FIN packet. and this is the root cause of this issue.

Conclusion and suggestion

The implicit release of response object is the root cause of this issue, this behavior will introduce an interval between service side close and client side close. This interval is longer than the virtual machine’s nf_conntrack_tcp_timeout_close_wait, and the conntrack entry will be discarded before client sends out FIN packet, hence this FIN packet will be marked as invalid and won’t be masqueraded. Suggestions on mitigating this problems are:

improve code quality, avoid such implicit release of resources
or an iptables rule like iptable -A FORWARD -j DROP -m conntrack --ctstate INVALID to drop each invalid packets.

29 Oct 2017

Packrat parser in R6RS

Packrat algorithm

Packrat parsing algorithm is an algorithm for parsing PEG¹, it can parse PEG in guaranteed linear time.² There are lots of resources about PEG and packrat algorithm here: http://bford.info/packrat/. Tony Garnock-Jones had an implementation of packrat in his blog, and he then rewrote it using MzScheme and then upon the parser he wrote a JSON read-write library. He also wrote a document about how to use it. This packrat library is included in Chicken Scheme's egg system³ and then be ported as a R7RS library⁴.

A brief walk through of packrat

PEG is very similar to regexp. Here is the comparison of concepts between PEG and regexp

	PEG	Regexp
atoms	any non-terminal symbol	any character
	terminal symbol	$, \>
	empty	empty
combination operators	sequence	string
	ordered choice	a \| b
	zero-or-more	a*
	one-or-more	a+
	optional	a?

Generally PEG is more powerful than regular expressions, for example, PEG can handle nested parenthesis while regexp can't, this is because PEG is recursive. For example, regular expressions can't process JSON data with arbitrary depth.

My Packrat extensions

I have written some extensions to original packrat library, these extensions is intended to make life easier with packrat. The code can be find here: https://github.com/ChaosEternal/packrat-extended, tested under ChezScheme. First, I add a (? pred) syntax, a place for a user defined predictors (in packrat.ss), then I add some help functions like character test functions and logic combination functions (in packrat-utils.ss), some of them(token, json-string) are isolated from the json parser.

Using Packrat

Macro packrat-parser provides syntactic sugar to build complex parser combinator from simpler ones. Input of these combinators are generators which are producing pairs of (kind value), and parser will check with the kind and take the value. Although the generator in the example specified in Tony's document generates ~'(kind value)~ where kind is symbol and value is numeric value, (char char) will be generated in most use case.

Parse json step by step

Very beginning

Generator

This generator read from a given port and produces a stream of (char . char).

(import (ext packrat)
	(ext packrat-utils)
	(rnrs (6))
	)

(define (generator p)
  (let ((ateof #f)
	(pos (top-parse-position "<?>")))
    (lambda ()
      (if ateof
	  (values pos #f)
	  (let ((x (read-char p)))
	    (if (eof-object? x)
		(begin
		  (set! ateof #t)
		  (values pos (cons #\x04 #\x04)))
		(let ((old-pos pos))
		  (set! pos (update-parse-position pos x))
		  (values old-pos (cons x x)))))))))

Parser for Whitespace
We can start from a very simple parser which just reads arbitary number of white-spaces. The parse returns nothing but an eof-object when parse reaches the end of file.

syntax

any = white eof

white = ws*
```
(define parser
  (packrat-parser any
		  (any ((white '#\x04) (eof-object)))
		  (white (((? char-whitespace?) white) 'whitespace)
			 (() 'whitespace))))
```

A reader function

A reader function integrates parser and generator.

  (define (read-any p)
   (let ((result (parser (base-generator->results (generator p)))))
     (if (parse-result-successful? result)
	 (parse-result-semantic-value result)
	 (error 'json-read "JSON Parse Error"
		(let ((e (parse-result-error result)))
		  (list 'json-parse-error
			(parse-position->string (parse-error-position e))
			(parse-error-expected e)
			(parse-error-messages e)))))))
#+nend_src
Now we can test our first parser:
#+begin_src scheme
  (read-any (open-string-input-port "    "))
  ;;=> #<eof>

Going Further

Parse comments

Comments are treated as white-spaces, so we can extend white-space parser with comment support. In json, two kinds of comments are accepted:

Start with //, end with newline, or
surrounded by /* and */ pair.

So the rewritten parser is:

(define parser
  (packrat-parser any
		  (any ((white '#\x04) (eof-object)))
		  (white (((? char-whitespace?) white) 'whitespace)
			 ((b <- comment) 'whitespace))
		  (comment (((token "/*") b <- comment-body) b)
			   (((token "//") b <- skip-to-newline) b)
			   (() 'whitespace))
		  (comment-body (((token "*/") w <- white) w)
			       (((? true) comment-body) 'skipped-comment-char))
		  (skip-to-newline (((? (inverse char-newline?))
				     skip-to-newline) 'whitespace)
				   (((? char-newline?) white) 'whitespace)
				   )))

We can test it:

(read-any (open-string-input-port "  /*comment*/ "))
;;=> #<eof>
(read-any (open-string-input-port "   //comment \n"))
;;=> #<eof>

Parse tokens

There are three basic json token: true false null. Now we deal with these real entities.

;; In json, empty list [] and null are distinct objects, 
;; but in scheme (null? '()) => #t and (list? '()) => #t, 
;; this is the reason why json-null is introduced here.
(define-record-type json-null)
;; token is defined in packrat-utils module.
(define (token str . comp?)
  (let ((cmp? (if (null? comp?)
		  char=?
		  comp?)))
    (lambda (starting-results)
      (let loop ((pos 0) (results starting-results))
	(if (= pos (string-length str))
	    (make-result str results)
	    (let ((res-token-value (parse-results-token-value results)))
	      (if (and res-token-value (cmp? res-token-value (string-ref str pos)))
		  (loop (+ pos 1) (parse-results-next results))
		  (make-expected-result (parse-results-position starting-results) str))))))))
(define parser
  (packrat-parser any
		  (any
		   ((white (token "true")) #t)
		   ((white (token "false")) #f)
		   ((white (token "null")) (make-json-null))
		   ((white '#\x04) (eof-object)))
		  (white (((? char-whitespace?) white) 'whitespace)
			 ((b <- comment) 'whitespace))
		  (comment (((token "/*") b <- comment-body) b)
			   (((token "//") b <- skip-to-newline) b)
			   (() 'whitespace))
		  (comment-body (((token "*/") w <- white) w)
			       (((? true) comment-body) 'skipped-comment-char))
		  (skip-to-newline (((? (inverse char-newline?))
				     skip-to-newline) 'whitespace)
				   (((? char-newline?) white) 'whitespace)
				   )))

We can test these tokens:

(read-any (open-string-input-port "  /*comment*/ true"))
;;=> #t
(read-any (open-string-input-port "   //comment \n false"))
;;=> #f
(read-any (open-string-input-port "   //comment \n null   "))
;;=> #<r6rs:record:json-null>

Recursive structure: List

Lists are such a struct which contains valid json structures.

		syntax
any	=	list
		token(true,false,null)
List	=	[]
		[any]
		[any,+]

white-spaces are ignored here

(define parser
  (packrat-parser any
		  (any
		   ((white '#\[ entries <- array-entries white '#\]) entries)
		   ;; list: [ json, json+ ] | []
		   ((white (token "true")) #t)
		   ((white (token "false")) #f)
		   ((white (token "null")) (make-json-null))
		   ((white '#\x04) (eof-object)))
		  (white (((? char-whitespace?) white) 'whitespace)
			 ((b <- comment) 'whitespace))
		  (comment (((token "/*") b <- comment-body) b)
			   (((token "//") b <- skip-to-newline) b)
			   (() 'whitespace))
		  (comment-body (((token "*/") w <- white) w)
			       (((? true) comment-body) 'skipped-comment-char))
		  (skip-to-newline (((? (inverse char-newline?))
				     skip-to-newline) 'whitespace)
				   (((? char-newline?) white) 'whitespace))
		  (array-entries ((a <- array-entries-nonempty) a)
				 (() '()))
		  (array-entries-nonempty ((entry <- any white '#\, entries <- array-entries-nonempty) (cons entry entries))
					  ((entry <- any) (list entry)))

		  ))x

Tests:

(read-any (open-string-input-port "  /*comment*/ [true, false, null]"))
;;=> (#t #f #<r6rs:record:json-null>)
(read-any (open-string-input-port "  /*comment*/ [true, [false, false, true], null]"))
;;=> (#t (#f #f #t) #<r6rs:record:json-null>)

Complete the job and going further

To complete this json parser, objects, strings and numbers are also needed to be processed, these jobs can be easily done by extending the parser above. The complete json-parser can be found in my github repository, it is origined from the json-parser in Tony Garnock-Jones' blog, but I have rewritten some parts of it using my packrat extension and utils lib. The parser can be extended to parse ruby serialized objects(similar to json).

Conclusion

Packrat is a powerful parser and in this blog, I briefed its structure and usage by constructing a json parser.

Footnotes:

https://en.wikipedia.org/wiki/Parsing_expression_grammar

https://en.wikipedia.org/wiki/Parsing_expression_grammar#Advantages

https://wiki.call-cc.org/eggref/4/packrat

⁴

https://github.com/petercrlane/r7rs-libs.git

29 Oct 2017

Scheme Syntax-Case usage

Macros of The Scheme Programming Language

Scheme has a powerful macro system, in most scheme implementation, only very limited primitive forms are implemented and others are derived by using macros. For example, r5rs standard defines 23 syntactic constructs, 11 of them can be defined using other more fundamental ones. ¹

There are many macro systems exist in Scheme's world, define-macro, syntax-rules, syntax-case are commonly adopted.

The `define-macro` system

define-macro AKA defmacro is the traditional way of defining macro in lisp world. Macros defined by define-macro are roughly procedures transform(expand) input form to another s-expression, the transformation itself usually be called "applying the macro". Generally speaking, the body of define-macro can be any expression which returns a valid scheme expression on evaluation, and the arguments of define-macro are also the arguments of the expression(if seen as body of a lambda).

For example, this macro:

(define-macro (when cond exp . rest)
  `(if ,cond
       (begin ,exp . ,rest)))

will define a macro named when. And on applying this macro, such form (when (> n 0) exp1 exp2) will be transformed (expanded) as:

(if (> n 0)
    (begin exp1 
	   exp2))

This transformation done by macro when is just filling out a template using parameters of when: (> n 0), exp1 and (exp2). Or say, the macro generates new codes.

Because the nature of macros, code generation, macros are very powerful tools in lisp languages as well as in Scheme.

Hygienic problem of `define-macro`

The define-macro macro system is so rough that it does not capture binding of any symbols appeared in macro definition, and these symbols' meaning can be changed when the macros applying, this will make the macros' behavior be very wierd if they are used with careless.

Consider the macro when defined in the previous secrtion, in the scenario demostraded here:

(let ((begin list))
  (when #f 1 2))

the code will be equivalent of:

(if #t
    (list 1 2))

This is because the define-macro only defines the form that when will be expanded but not the binding of symbols, these symbols' meaning can be altered in different context.

Such problem is called the hygienic problem.

Scheme implatations which have define-macro system also provide gensym or similar procedures which can generate symbols on the fly to mimic(partially) the hygienic problem of the define-macro system. For example, this code defined a naive macro swap! which swaps value of two variables:

(define-macro (swap! o1 o2)
  `(let ((tmp ,o1))
     (set! ,o1 ,o2)
     (set! ,o2 tmp)))

If one uses this marco like (swap! tmp something) or (swap! something tmp), the symbol tmp will be captured and the result is incorrect. This macro can be rewritten using gensym:

(define-macro (swap1 o1 o2)
  (let ((tmp (gensym)))
    `(let ((,tmp ,o1))
       (set! ,o1 ,o2)
       (set! ,o2 ,tmp))))

In this version of swap!, the danger of unwanted symbol capture will not exist.

The `syntax-rules` system

syntax-rules is a solution to hygienic problems. Since it is adopted in the R5RS, it is wildly implemented in different Scheme implementations. syntax-rules uses pattern matching to construct macros.

With syntax-rules, the macro when can be re-written as:

(define-syntax when
  (syntax-rules ()
    ((when cond exp1 ...)
     (if cond
	 (begin exp1 ...)))))

The syntax of using syntax-rules is (syntax-rules literals (pattern template) ...), the macro will replace evey matching pattern to the corresponding template.

The major difference between syntax-rules and define-marco is that syntax-rules garentees hygienicness, in fact, all symbols occure in template has definite meaning: if the symbol is matched in pattern, the pattern value is captured, else the binding of symbols in the environment is captured. Thus these symbols' meaning remains the same wherever they are used, this is hygienicness of macro.

syntax-rules is very useful, but it also has limitations²^,³, it does not provides ways to modify the templates, there exists some cases where it is harder to write with syntax-rules than define-macro. For more details about syntax-rules system, this article⁴ is very useful.

The `syntax-case` system

The syntax-case system is somehow similar to gensym approach, but it is an extreme. It has the same power as define-macro but retains hygenicness. Using syntax-case, the macro when can be rewritten as:

(define-syntax when
  (lambda (stx)
    (syntax-case stx ()
	((_ cond exp1 . exp2)
	 #'(if cond
	       (begin exp1 . exp2))))))

Here, the form #'(if cond (begin exp1 . exp2)) is the corresponding part of `(if ,cond (begin ,exp . ,rest)) of define-macro, they have similar structure. The difference here is only that the body of syntax-case should return a syntax object, not a mere list as define-macro returns. But in syntax-case, symbol if and begin was captured the definition of defining environment while cond, exp1 and exp2 are replaced with applying arguments when the macro is applied, since they appear in the pattern form of syntax-case. This is how hygenicness is archieved.

The macro swap!:

(define-syntax swap!
  (lambda (stx)
    (syntax-case stx ()
      ((_ op1 op2)
       #'(let ((tmp op1))
	   (set! op1 op2)
	   (set! op2 tmp))))))

The expanded result(using 'expand' of ChezScheme (expand '(swap! var1 var2))):

(let ([#{tmp cvbrbh0gnpwjvy0dl1m1nb-1} var1])
  (set! var1 var2)
  (set! var2 #{tmp cvbrbh0gnpwjvy0dl1m1nb-1}))

It is shown that hygenicness is automatically done by the same way of gensym.

The 'syntax' objects is a special type objects defined in scheme implementations which has syntax-case system ported. They can be converted to scheme objects on demand, and when returned by syntax-case, the macro expansion is done.

The following examples shows a more complex usage of syntax-case, upon expansion, the macro will be expanded to n-th fibonacci number:

;; this macro using iterative approach
(define-syntax fib
  (lambda (x)
    (syntax-case x ()
      ((fib n)
       #'(fib n 1 0))
      ((fib 0 num2 num3) (syntax->datum #'num2))
      ((fib num1 num2 num3)
       (with-syntax ((n (datum->syntax #'fib
				       (- (syntax->datum #'num1)
					  1)))
		     (n2 (datum->syntax #'fib
					(+ (syntax->datum #'num2)
					   (syntax->datum #'num3))))
		     (n3 (syntax num2)))
	 #'(fib n n2 n3))))))
;; this macro is recursive and much slower
(define-syntax fib2
  (lambda (x)
    (syntax-case x ()
      ((fib2 0)
       #'1)
      ((fib2 1)
       #'1)
      ((fib2 n)
       (with-syntax ((n2 (datum->syntax #'fib2
					(- (syntax->datum #'n) 1)))
		     (n3 (datum->syntax #'fib2
					(- (syntax->datum #'n) 2))))
	 #'(+ (fib2 n2) (fib2 n3)))))))

The syntax-case system is so powerful that even the syntax-rules system is just a derivation of it⁵.

In rencent Elixr⁶ release(1.2), a ~with~⁷ clause is introduced to provide guarded execution of some code, the same effect can be easily archieved in scheme by using macros:

(define-syntax and-let*-values-check
  (lambda (stx)
    (syntax-case stx (else)
      ((kw (?bind0 ?bind1 ...) ?body0 ... (else ?expr0 ...))
       (syntax-case #'?bind0 ()
	 (((?b0 ?b1 ...) ?exp)
	  (let* ((?check-b0 (car (generate-temporaries (list #'?b0)))))
	    #`(let-values (((#,?check-b0 ?b1 ...) ?exp))
		(if (equal? #,?check-b0 ?b0)
		    (kw (?bind1 ...) ?body0 ...)
		    (begin ?expr0 ...)))))))
      ((_ () ?body0 ... (else ?expr0 ...))
       #'(begin ?body0 ...))
      ((kw (?bind0 ...) ?body0 ...)
       #'(kw (?bind0 ...) ?body0 ... (else #f))))))

Conclusions and more

In scheme's world, the word 'syntax' mostly refers to two things, one is the lexical structure of program, which defined by the reader, usually they are s-expressions or converted to s-expressions by reader, other one is the context structure of a program, which defines how program is executed. Due to the powerfulness of s-expressions, the first meaning of syntax is insignificant compare to the second one. And with the macro system, one can easily extend the syntax scheme, just as shown in ⁵. The lastest example of previous section even shows how to borrow syntax-sugars from other languages. Yes, syntax-sugar, as its name, they are not essentials but sugars just make life happier. And even type systems can be defined using macros⁸.

Footnotes:

https://en.wikipedia.org/wiki/Scheme_(programming_language)#Minimalism

https://groups.google.com/forum/#!msg/comp.lang.scheme/gKdyiCnf54E/sxo_gYLvWuUJ;context-place=msg/comp.lang.scheme/Alkectj2ge0/N65W9KNd6LcJ

https://www.gnu.org/software/guile/manual/html_node/Syntax-Rules.html

⁴

https://hipster.home.xs4all.nl/lib/scheme/gauche/define-syntax-primer.txt

⁵

http://www.cs.indiana.edu/~dyb/pubs/tr356.pdf

⁶

https://elixir-lang.github.io

⁷

http://learningelixir.joekain.com/learning-elixir-with/

⁸

http://www.ccs.neu.edu/home/stchang/popl2017/

17 Sep 2017

My new blog

After keeping silence for many years, I am starting a new blog,

Hello, twisted spacetime

This is my new blog system, which is still under construction. In this space, I will talk about Scheme, Linux, and other related topics, these will record my adventure into Schemer's land.

Build of this site

For integrating these things, these tools are used:

emacs
org-mode
org-static-blog
language-tool
emacs langtool
git

and more.

Fixing a greenplum performance issue

The issue

Proves.

Suppose a table 'onefield' with only one VARCHAR field, with maxium field length 440 bytes:

use regular WET(writable external table):

Export all records in table 'onefield' to 'w16m_ext_1'

The Same table , but using COPY:

The Same Data, but aggregate many rows into one row to make aprox. 18M line:

export the data to WET:

data size

Conclusion

Upshot

Footnotes:

A Log of Network Trouble Shooting

The problem

Analysis

Checking connection status

Packet analysis using tcpdump

Analyze the code

Conclusion and suggestion

Packrat parser in R6RS

Packrat algorithm

A brief walk through of packrat

My Packrat extensions

Using Packrat

Parse json step by step

Very beginning

Going Further

Complete the job and going further

Conclusion

Footnotes:

Scheme Syntax-Case usage

Macros of The Scheme Programming Language

The define-macro system

Hygienic problem of define-macro

The syntax-rules system

The syntax-case system

Conclusions and more

Footnotes:

My new blog

Hello, twisted spacetime

Build of this site

The `define-macro` system

Hygienic problem of `define-macro`

The `syntax-rules` system

The `syntax-case` system