Let's be friends?

Sep 10, 2008 at 12:24 AM
Hi there.

I have been exploring how to make an ambitious compiler framework of my own for the past year or two. I haven't really come to a conclusion yet but I would like to reach out and meet people doing similar things. Well, what I'd really like to do is find someone that wants to do the same thing as me, and combine my project with theirs, or share code. There is a chance that you might be that someone, depending.

My framework is called Loyc--Language Of Your Choice. I envision Loyc to be an extensible .NET compiler framework that will support multiple languages simultaneously: for example, you would be able to compile a Foo.vb, Bar.cs and Baz.boo into Fubar.dll. I want it to be a basis for language extensions. This means that people will be able to modify the syntax and semantics of C#, boo, or other supported languages with new constructs and features, such as

- an embedded DSL
- new statements (e.g. you could add "until" and "unless" statements to C#)
- "virtual" data types (which would be transformed into something else at compile time)
- code analysis tools

Ideally it will be easy to combine new language features written by different people.

I imagine that Irony could be a basis for Loyc by providing one possible parsing mechanism (I would like people to be able to choose any parsing system they like), a standard AST model, and a standard back-end. However, I may have some concerns about Irony's design and I wonder if you'd be open to design changes at this point.

Have a good day, sir!
Coordinator
Sep 10, 2008 at 4:57 AM

Well, I don't mind being friends with anybody. Nice! - you already have a name for your framework... then what's the problem? I don't quite understand what is your reservations about Irony. It is an open source project, so if you don't like something, you can always take the source and change it to your liking. I can't promise I would quickly respond to any request for new feature or algorithm, but I welcome everybody to contribute any pieces that can be useful for Irony's target audience. I specifically don't insist on exclusive use of LALR parsing algorithm, I'm planning to implement some alternative myself, and would welcome other parser implementation. In fact, the main purpose of latest refactoring of Irony was to enable using different type of parsers.

So, make your suggestions/requests for improvements, and we'll see...

thank you

Roman

Sep 12, 2008 at 5:22 AM
I know I can change the source to my liking, but I'm simply saying I wonder if you'd consider design changes, so that I wouldn't have to fork the framework.

I haven't reviewed the framework in detail; in fact for the most part I have no idea what I'm looking at... but one of the most important things to get right is the AST. Because working with an AST is central to any compiler, it deserves a lot of attention. I would like to see an AST design that is very general, so that it would be appropriate for just about any language and parser. Indeed, I would like an AST design that you take out of the framework and use almost by itself. But your AstNode class has references to some things that are Irony-specific. There's a reference to BnfTerm, and mentions of EvaluationContext, which in turn references LanguageRuntime. What are those last two things? I'd also like to understand the rationale for NodeArgs.

I'm concerned about the size of the AST. I would like to make a framework that is quite scalable--specifically, that can scale down to small devices like OLPCs and Smartphones. ASTs can easily suck up huge amounts of memory (from the perspective of a low-power device), and it looks like the AST in Irony is far from frugal. Each node contains a List<AstNode>, for example, even leaf nodes, and it uses 4 integers to describe its location in the source code when two could suffice.

Actually, although I picked the same name for my AstNode class, the APIs are so vastly different I wonder if they can be reconciled.

I would be interested to learn about how you store type information and manage scope information, two problems I haven't thought of solutions for yet. I guess it would be easier if I was designing a compiler specifically for a well-defined, closed language... but when I try to think about what is needed for an extensible compiler, my mind draws a blank. A related issue is the basic question of what kind of type system to have, and I don't even know where to look for ideas.

By the way, I can't compile the unit tests in SharpDevelop because the unit test framework is Microsoft-specific; do you know if that framework comes with Visual C# Express?

P.S. Is CodePlex's email notification of replies working for you? Cuz it didn't for me :(
Coordinator
Sep 12, 2008 at 4:25 PM
Edited Sep 12, 2008 at 4:52 PM
About AST nodes. AstNode class references a lot of stuff, and this information is used for implementing its intended functionality. You can try making it slimmer, but the problem with this - the slimmest class of all, "Object", is the most universal and slimmest thing, but it is almost useless, except as universal reference type, blackbox, for temporary holding a ref to an object which ultimately will be downcasted to be useful for its intended purpose.
If you have an alternative AST node hierarchy - you can transform Irony's tree into your tree using simple node visitor.
The size of AST - as latest trends show, saving a byte here and there is not a proper way to optimize the app. Micro-optimizations bring almost no effect compared to macro- (large-scale) efficiency of the code. App performance suxx most of the time not because of micro-inefficiencies, but because it does many things over and over again, on a big scale, and replicates data again and again in different parts of the code jungle. This is my experience. Memory becomes cheaper and more available even on handhelds and embedded devices, so worrying about single bytes produces only negligible effects. Worse, it can complicate the code so much that it may result in large-scale inefficiencies.
Storing type information, scopes, etc - not much there for now. Code analysis is in beginning phase.
Unit tests - I switched from NUnit to VS test framework, just for convenience of the majority of downloaders - no extra installs, all in VS. Sorry, have no information on Express versions or SharpDevelop.
CodePlex notifications work fine for me, I"ve got notified about your post.
Overall - Irony is just a tool, very powerful indeed, but it is not solution to all world problems. If you try to build all-world-problems-solver, Irony can help but only with the subset of things that it is built to solve. The rest is on your to-do list.
Respectfully
Roman
Sep 27, 2008 at 8:43 PM
Hi. Sorry about the slow reply; I've been busy catching up with work (at work).

About AST node size. Of course I'm not worried about single bytes. I'm worried about how big an AST is to represent a moderately large program--say, 1 MB of source code. If such a program consists of 300,000 tokens and 300,000 AST nodes, and each of each requires 150 bytes total, then holding all that in memory would require nearly 90MB. Each integer or reference you remove from an individual node would save 2,400,000 bytes, and an empty List<object> costs about 28*600000=16,800,000 bytes.

Careful AST design might cut the node size in half, and I wouldn't call 45 MB a micro-inefficiency from the perspective of a cheap laptop. I think that Moore's law is changing, as a global median, from technology getting faster and higher-capacity to merely getting cheaper, and I'd like, eventually at least, to target cheap devices.

Mind you, I don't actually know the node density of a typical program... maybe it's less... but in case it's large, I want a design that can be optimized if necessary. They say premature optimization is evil, but unfortunately some optimizations become difficult or impossible as a code base goes into widespread use. For example, the .NET framework people decided in their design of IEnumerable to require two relatively expensive interface calls (MoveNext and Current) for every iteration, when one call would have been enough. Of course, this optimization is out of the question for Microsoft; it could only have been done near the beginning. Likewise if Irony or Loyc becomes popular, changing any public interface becomes painful, so any optimization that you might like to implement "someday" should be supported by the public interface now.

Mind you, unfortunately it has to be a trade-off--to possibly optimize space you may have to sacrifice some speed. For example, I think that it should be possible for somebody to implement a fixed-size AST node class, that has, for example, exactly two children, or even no children. The current design doesn't allow this, as ChildNodes is a field of type List<AstNode> (well, AstNodeList, splitting hairs). It would be more flexible if ChildNodes were a property that returned IList<AstNode>; a hypothetical fixed-size node class would implement IList<AstNode> directly and its ChildNodes property would simply return this. This is somewhat inefficient speed-wise though; better speed would be obtained by accessing a virtual indexer and Count property on AstNode itself. I guess it sounds like I'm focusing on micro-opimizations again, but I hypothesize that the AST interface (like IEnumerable) is one of the most heavily-used interfaces in any compiler, and it could even become an ad-hoc standard across many compilers--but only if it is well designed. Plus, it's one of the hardest things to change once it is in widespread use. Therefore, in my view, it deserves to be designed more carefully than almost everything else.
If you have an alternative AST node hierarchy - you can transform Irony's tree into your tree using simple node visitor.
I realize that, but it would be nice to avoid that complication, don't you think?

My original AST design actually called for nodes with multiple child lists. For example, to represent a function definition node like

void Foo<T> (int x, int y) { ... }

The node would actually have three child lists: a list of generic type parameters, a list of regular parameters and a list of statements in the function. I'm already happy to dump that design in favor of Irony's one-list-per-node approach, so that the function above would be represented instead by an AstNode with three children (or four--one more child for the return type), and each of the three children would be another AstNode with its own list of children.

The point is, maybe we have substantially different ideas for what an AST node should have in it, but maybe we could compromise and come to an agreement. At least we could use a common base class or something. And perhaps we could put out a small library with core features that are common to most compiler tools.

I just tried loading Irony in VC# Express 2008 and the unit test project would not load ("unsupported project type"). The project loads in SharpDevelop but the unit test library dependency is not available.
Coordinator
Sep 30, 2008 at 4:55 AM
Edited Sep 30, 2008 at 4:57 AM

Hi
You write:
   ...maybe we have substantially different ideas for what an AST node should have in it, but maybe we could compromise and come to an agreement. At least we could use a common base class or something. And perhaps we could put out a small library with core features that are common to most compiler tools.

I don't mind improving Irony's base classes, and making it better for use in compiler implementations. But it's not enough to simply say "let's remove this property in this object" - you need to see the changes and (possibly) complications in the code, both in the Irony core and in solutions based on it which are result of the change. What I would like to see is a solution or sample that shows deficciencies of current Irony's features.
I think your worries are just the case of "premature optimization". I've looked at many other compilers and language implementations, and I think Irony does many things much better than many of them. So I suggest you to start with a real sample, and lets go from there.
thank you
Roman
 

Coordinator
Sep 30, 2008 at 5:13 AM
Edited Sep 30, 2008 at 5:14 AM
On another subject, about child nodes. There is a good reason to always keep all sub-nodes in ChildNodes list. Even if you have dedicated fields for specific parts (like in your example for proc node - parameters, statements, etc) - even in this case ChildNodes should contain all of them. The reason is because it is very convenient for iterators/visitors, used by code analysis routines - all child nodes of the node are in one place, so visiting the entire tree in analysis phase is trivial. Look how specific nodes implemented in Irony (in AST namespace) - the specific nodes have dedicated fields for each specific child, but still put all children into ChildNodes. So code analysis visits all nodes using simple algorithm. 
Roman