Plan for dynamic types in HamsterSpeak

Jump to: navigation, search

Plan: a massive new version of HamsterSpeak which is a superset of the existing language. Discussed here specifically, to turn HamsterSpeak from a typeless to a dynamically typed language and make use of this architectural improvement by introducing various data structures.

You should be able to store objects of any type in any variable, including local variables, global variables and special data fields like NPC extra data.

Requires Plan for new save format to be completed first.

Some possible types (a kind of goals overview)[edit]

  • integer
  • float
  • string
  • array
    • A sequence type. 0- or 1-based? Most probably 0-based: most other things except strings are. Allowing a flexible lower bound in BASIC has been useful in the OHR source, but seems like a bad habit to be kicked.
    • Dynamically expanding would be nice, but what about out of bounds checking?
      • Expand-on-write was only suggested because of Moogle's PHP snippet. Maybe dynamically expand on write, bounds-check on read?
      • Maybe it is best to always bounds check arrays, and leave dynamic expanding for associative arrays
        • Even if arrays cannot be dynamically (automatically?) resized, there could still be a "append" command, or a hamsterspeak equivalent of basic's "redim preserve"
      • If arrays don't expand on write, that implies notation to specify initial size should be given
        • MATLAB has 'a = zeros(5)' and 'b = ones(10, 10)' or 'b = ones([10 10])' and 'c = zeros(size(b))' to create a new array of same size as existing
          • MATLAB size function would have been useful, but it would only handle multidimensional arrays, not an array of arrays.
        • Could use 'a := [10, 10]' for new two element array and 'a := array(10, 10)' for new 100 element blank array
    • Would also like efficient multidimensional arrays
      • Ability to access tiles using a special 2D array, and maybe pixel data for sprites in the same way?
        • 2D array access to graphics would probably require a copy-on-write optimization for the sprite cache.
  • associative array
    • ie (hash)map or dictionary. Could allow strings and integers as keys. Using composite types as keys would have some nice uses but might overcomplicate implementation.
  • opaque type
    • (like PHP resource type)
    • (To the interpreter, there would only be one 'opaque type', each object would carry its own subtype information)
    • For things like menu item handles, to improve type checking. Invalid handles could not be created by the users (though they could easily become invalid later), and resources could actually be automatically freed when no more handles to them exist in any variable!
    • Would also be great for debugging
    • Only want to support :=, == and <> operators
      • How do you implement == and <> anyway?
    • Each object should also be usable as a truth value for statements like 'while (handle := next sibling (handle))'. In fact, all opaques could count as true, and integer 0 returned for false.
      • Alternative ideas: Or perhaps we could have a single special false/null/error (opaque) object for all of these, or each object may be either true or false
  • weak type
    • An integer which knows what it's meant to mean, similiar to opaque type. Mostly to assist debugging. Could be returned from NPCreference and so on
    • Supports all operations an integer does, and should be considered an integer for type checking.
    • We could optionally throw a warning if a weak type of unexpected type is recieved, or an integer instead of a weak type. Optional because this will occur legitimately.
    • Math operations on it should probably 'clean' it into an integer?
      • And throw an (optional) warning
  • user defined type/struct/class
    • What on earth should the syntax for this be?
    • Use of sequences indexed with constants gets pretty messy in Euphoria
      • Is it worthwhile to keep working in euphoria? A reimplementation in another language (FreeBasic? Python?) might not be as much work as it first appears.
    • Python of course implements objects as dictionaries (with easier syntax)
  • If references to other variables are implemented, they might need a type
  • Temporary objects created during array manipulation might need a type

Language Design Details[edit]

What to do about strings[edit]

Having to keep supporting existing strings (and "plotstrings"), we will (probably!) want to rethink existing string commands to support both types. Of course it might be an option to decide that old style strings and new style strings are not compatible, so-dont-expect-them-to-do-the-same-thing: either by duplicating commands or giving each command two different behaviours.

It might help to think of the integers 0 - 31 as string references instead of strings, depending on how references will work, if at all.

Existing commands returning strings take the string ID of the output, if there is one, as an argument. With typing, they should obviously return a string directly

get hero name (0, hero:Bob)
name := get hero name (hero:Bob)

How do we fix this?! We could create a second version of each command, such as

name := get hero name$ (hero:Bob)

Making the first argument for each commands optional probably wounldn't work, since some of them already have optional arguments

Existing string commands take string IDs as input. We could change these to accept either old style integer handles or strings as arguments. Consider a command which modifies its argument, like delete char (ID, position). A few thoughts:

a := "foo"
$5 = "bar"
#first possibility
delete char (5, 1)
delete char (a, 1)

Here a is passed by reference because it is a string, but passed by value if it's an integer.

#second possibility
delete char (5, 1)
a := delete char (a, 1)

Here string commands return a new string if they are passed a string instead of a 'string reference'...

delete char (5, 1)
delete char (@a, 1)

Now we are supposing that it's possible to explicitly take a reference to a string. (And you may or may not allow 'a := delete char (a, 1)')

#third possibility
delete char (5, 1)
a := delete char$ (a, 1)

Duplication of each existing string returning command.

Modifying objects vs. rebinding variables[edit]

References to objects and passing by reference[edit]


Arrays and associative arrays will be indexable using a [] operator and iterable using a new for each construct.

We could provide an array command for constructing arrays (eg a := array(1, 2.0, array(3), array()), but it would be a good idea to provide quicker syntax, such as

a := [1, 2.0, [3], []]
g([a, a[2]])[1]
  • Would this create ambiguities for HSpeak?
    • Bob the Hamster: Hspeak is currently horribly over-lenient about allowing punctuation characters in names. a script name or variable name is currently allowed to contain brackets-- fortunately I am pretty sure that nobody actually does such a silly thing just because I was silly enough to make it possible.
    • The Mad Cacti: It is ambiguous!! Consider f(a[1]) and f(a,[1]). Both would be processed by HSpeak (assuming [ and ] are split like ( and ) ) to
It looks like we're going to have to change the way lexemes are split, see down the page at #Lexer changes for arrays. Or we could change to some other syntax for constructing arrays, though I don't see much point.

(Details of multidimensional arrays and much else to be added here)

Array operators[edit]

A membership test operator is required. What to use?

Multiple indexing?[edit]

Could allow multiple indexing to build a new array from an old one, eg.:

a := [10, 11, 12]
a[1, 1, 2] == [11, 11, 12]

However, this might be best accomplished with a built-in function instead of a new way to use the index operator:

pick from (a, [1, 1, 2]) == [11, 11, 12]

IIRC, in MATLAB this is done by indexing a matrix with a matrix:

a := [[10, 11], [12, 13]]
b := [[0, 0], [1, 0]]
a[b] == [10, 12]

But that's incompatible with the first proposal, unless the ability to do multidimensional indexing is dropped:

[10, 11, 12][1, 1, 2] != [10, 11, 12][[1, 1, 2]] == [11, 11, 12]

Element-wise arrays operator?[edit]

Another low priority feature.

In Euphoria and matrix based languages like MATLAB, you can perform all the normal arithmetic operations on each element of two arrays, producing an array.

Eg. in Euphoria:

{1, 2, {3, 4}} + {5, 6, {7, 8}} = {6, 8, {10, 12}}

MATLAB has a special version of each operator, eg matrix ^ integer is matrix exponentiation, matrix .^ integer is element-wise exponentiation.

We could follow numpy's and euphoria's approach of not having special element-wise operators. Would need to use some other operator to, for example, concatenate two arrays instead of using +.

Associative arrays[edit]

What about a quick syntax for associative arrays? Say, something similiar to Python's syntax

a := {}
b := {3 = "3", "4" = 4, 5.0 = a}

Also, what about using associative arrays as sets? And multimaps and multisets?

Getting closer and closer to Python...

User data structures[edit]

Possibly! Would like to fit into RELOAD somehow.

I like the idea of prototype-based instead of class-based objects, it might actually simplify things.

New language constructs[edit]

We can support many of our favourite features found in other languages!

for each[edit]

A way to loop over arrays/associative arrays. Unfortunately it would be a unreasonable to add 'in' as an operator as in python, so we might have to resort to the same ugly syntax that for uses:

for each (variable, array) do ( ... )


variable (x, i)
for each (i, [1, 2, 3, 4]) do (x += i)

Not sure about supporting an iterable interface on objects. For simplicity, perhaps we could add no such feature to the language, yet possibly use iterators internally for efficiency.

Returning multiple values[edit]

Returning multiple values can be achieved by returning an array and unpacking the values by assigning to an l-value array (where an array of l-values is an l-value). Maybe associative arrays too!:

return([1, 2, [3, 4]])
[ foo, bar, [qux[1], qux[2]] ] := myscript()

Replace runscriptbyid with function call operator syntax?[edit]

We could conceivably add a function call operator to replace runscriptbyid, and have it support both scripts and built in commands (and methods on objects?). In languages like Python there is no difference between calling a function and a function reference, but in HS these will have to be subtly different due to the language's weirdnesses.

Obtain a function reference with the @ 'operator' as normal, but extend it to built-ins:

funcref := @walk npc to x
funcref2 := @current map

The function call operator appears as you would expect:

funcref(npc, 20)
foo := funcref2()

However, brackets when calling a script/function without arguments are optional (and were not allowed at all in old HSpeak versions!), so the function call operator syntax needs to be handled carefully. Brackets are always required. In addition, to call a function reference returned from a function call without arguments, you would have to write foo()() or (foo)(). See #Lexer and parser changes for function calls for details.

For backwards compatibility, function references might have to be weak types (so that script id numbers can still be retrieved as integers using @)

Variable number of arguments?[edit]

Originally intended years ago! A way to have extra arguments placed in an array bound to a special argument. Don't know what syntax to use for this. Maybe best to copy Python!

script, foo, a, b, c = 0, *d ()
...inside foo we have: a=1, b=2, c=3, d=[4,5]

Array unpacking?[edit]

As a dual to the above feature, Python lets you unpack tuples into argument lists. We could allow the same (again, syntax undecided) but it probably doesn't have enough utility to bother with, over other possible features.

foo(a, b, *[c, d, e])
equivalent to
foo(a, b, c, d, e)

Implementation Details[edit]

I will be happy to work these out myself to achieve the design. The Euphoria source is most helpful.

C with GNU extensions is probably the best language to use for this sort of project, so that's what I've been using.

Storing dynamic types in variables[edit]

After reading through the Euphoria interpreter sources, I've written some code for storing an object of arbitrary type in a 32-bit integer. You may have a look at the messy proof-of-concept source (SVN link below), mostly interpreter.h and datatypes.c. However, I'm a bit unhappy with the encoding; may move to using tagged pointers instead (this might require nasty 8 or 16 byte alignment!)

The scheme implemented supports 32 bit integers, yet stores most integers in a single 32 bit word without memory overhead, which I could not find in other VMs.

Garbage collecting (via reference counting?)[edit]

CPython implements garbage collection using reference counting. In CPython 1.x, circular references cause memory leaks, and Python programmers are meant to know how to avoid them by manual destruction or use of weakrefs. CPython 2.x adds a generational garbage collector, enabled by default, which periodically scans for dead cycles and attempts to delete them. However it still can't free all dead cycles. See [1], [2]. Note that we might be able to, if required, use a similiar scheme to CPython except maintain 'gc_refs' at all times instead of sweeping the set of containers.

Euphoria also uses reference counting and does not leak memory because reference cycles are not possible. How clever.

Whether circular references will be possible in HS depends on the answers to the above design questions.

Allowing dynamic types in Custom[edit]

To let users pass values of any type in Custom as script arguments we need an 'value' editor and a lump to store them so that they don't need to be accomodated in existing lumps.

Currently you can only pass 16-bit integers to scripts. We would either need to change existing lump formats to change this to 32-bit values, or use a different format for packing values than in-game.

This is low priority and can be completed later.

Saving dynamic types in Save files[edit]

Plan for new save format

Associative array implementation[edit]

B-tree/B+tree vs. hashtable. PHP apparently uses a hashtable, yet manages to record order

Iterators/Builtin Arrays[edit]

In addition to user defined arrays, global variables could be treated as a special array, and proxy array objects could be returned by commands, to allow direct access to things such as tilemaps, and sprite slice pixel data. These would usually be of restricted type and not extendible. This might turn out to be a lot of extra work

To implement these, the interpreter could have an internal concept of an iterator, even if we do not add iterator objects to the language, or much simpler, something like a numpy array which uses strides to calculate element address.

Changes to HSpeak[edit]

Lexer changes for arrays[edit]

In order to be able to distinguish indexing from array construction (see #Arrays) HSpeak might have to become picky about commas. It could explicitly keep track of whether a comma or newline occurs before a [. If one doesn't, it inserts a special indexing operator. [ and ] can then be grouped in the same way as begin and end, and the indexing operator will behave as others.

However, HSpeak also automatically inserts commas around (, ) and operators. Operators would have to be treated differently from parentheses due to statements like a := [3]. So implicit commas due to operators prevent indexing, those due to parentheses don't.

-> a,[,4,]

   a := [1, 1, 1]
-> a,:=,[,1,1,1,]

   f(b)[4, 5][3]  #multiple indexing in the air
-> f,(,b,),<index operator>,[,4,5,],<index operator>,[,3,]

-> f,(,b,),<index operator>,[,4,]

However, when user defined arrays are designed, the requirements for HSpeak might change again.

Lexer and parser changes for function calls[edit]

See #Replace runscriptbyid with function call operator syntax?

Currently, using parentheses to build the AST is the next thing HSpeak does after tokenising and scanning for variables. When HSpeak finds an empty argument list it ignores it, and an empty pair of parentheses are only allowed after something that takes_args. That would have to change:

The easiest solution could be to have the lexer insert <call operator> where it finds parentheses after something without a comma inbetween, but possibly a newline, or where ',begin' is used instead of '('. Then have the parser automatically absorb those for normal script/function calls and flow contructs (in get_script_cmd). Currently, f,() is identical to f(). We should kill this leniency in the parser while we're at it (but continue to allow newlines and ', begin'). Therefore, <call operator> would be required even for normal calls, or raise an 'ugly code' error.

Any <call operator>'s not absorbed after normal takes_args tokens would be left in, and then later converted from infix operator to a flow control construct, and checked for correctness and processed in normalize_flow_control as normal.

-> f,<call operator>,(,b,)
-> {f, {b, {}}}   # AST

   then, begin
-> then,<call operator>,(

-> f,<call operator>,(,),(,)
   # error during parsing: "found empty parentheses not associated with a function call"

-> f,<call operator>,(,),<call operator>,(,)
-> {f, {}}, {<call operator>, {}} {"", {}}  # AST, parentheses are translated into commands with the empty string as the name
-> {<call operator>, { {f, {}}, {"", {}} } } # convert operators
-> {<call operator>, { {f, {}} } } # flow control logic: do a cons, extending argument list with second argument

However, this is pretty hacky, it would be better to implement function calls in the following proposal, in which case gather_local_vars, compile_commands, get_script_cmd, get_cmd_depth would probably be rewritten:

Architecture improvements[edit]

As a more extreme proposal, HSpeak could be largely rewritten to include a proper tokenizer. Instead of throwing strings around, it would translate them as soon as possible to kind,value pairs. Keywords, operators, integers, (floats?) and strings would be handled quickly, while variable/script/function names would be decoded during parsing. I expect this would dramatically improve compile time and speed implementation of features, though I think we should be able to do everything by continuing to hack around the problem. Euphoria compiles HSpeak quicker than HSpeak compiles plotscr.hsd!

Providing a debugger[edit]


Some ideas about including debug information in .hs files was discussed at Talk:Henceforth VM#Moving Forward

See especially Plan for sane script error reporting

Implementation Plan[edit]

It might be worth having another look at the core of Euphoria's interpreter, written in C and now GPL, to see what can be adapted. It has many similarities, but features will have to be removed and added, and in the end probably every opcode would have to be modified anyway for the differing list of types (especially to use 32 bit instead of 31 bit integers!).

The roughest of outlines (and we should use a branch for once instead of the hopeless goal of keeping the trunk stable throughout the whole implementation):

  • Design the language! (this is where work has been hung for a year)
  • Design a bytecode and write an interpreter for it (in C) (well progressed)
    • Use FB strings for the string types
  • Add support to HSpeak for new syntax
  • Write an HSZ to bytecode translator (in C) (in progress)
  • Combine interpreter and Game
    • Write wrapping code to let Game call the interpreter, and let the interpreter call builtin functions in Game
    • Game's script loading code could be preserved
  • Replace the existing interpreter, and disable the script debugger
  • Make changes to existing builtin commands as required
  • Add new commands to make use of types
  • Write a new graphical debugger (though there'll surely be a crude one added earlier)
  • Testing and merging

SVN repository for the interpreter until it is actually used by the OHR and migrates into a branch of the OHR SVN repository or git: svn://

Possible Code Examples[edit]

variable (myVar)
myVar := 1
myVar := "Now it's a string"
myVar := 0.5
myVar := {}
myVar["key"] := "Now it's an associative array!"
myVar[781369172] := "(integers and strings as keys)"
myVar := []
myVar += "Now it's a dynamic array!"

Related plans[edit]