Talk:Plan for dynamic types in HamsterSpeak

Jump to: navigation, search

The Mad Cacti: Replies to some comments which don't really need to clog up the plan:

  • 2D array access to graphics would probably require a copy-on-write optimization for the sprite cache.

I expect copy-on-write will be used for most/all objects, but copying the sprite buffer to create a new sprite would be different from this (a special case?). I suspect function pointers will be needed...

  • Use of sequences indexed with constants gets pretty messy in Euphoria
    • Is it worthwhile to keep working in euphoria? A reimplementation in another language (FreeBasic? Python?) might not be as much work as it first appears.

I didn't mean that it would be annoying to implement in Euphoria, I meant that Euphoria shows that syntax for real structs is much nicer than just using arrays and constants. I don't actually think any of this would be too difficult to add to HSpeak as long as we choose a simple way to deal with [] (but it will surely mean HSpeak getting even messier). I am annoyed with how long it takes to compile though, and couldn't see any obvious improvements with the tracer.

  • Bob the Hamster: Hspeak is currently horribly over-lenient about allowing punctuation characters in names. a script name or variable name is currently allowed to contain brackets-- fortunately I am pretty sure that nobody actually does such a silly thing just because I was silly enough to make it possible.

Though Fyre likes to point out that when exporting .hsi files [ and ] are stripped from names

Bob the Hamster: I think changing the way that symbols are lexed would be great. The only reason that hspeak currently inserts commas all the heck over the place is because back when I originally wrote it, I didn't know any better. Making [brackets] and operators not do that would be great. The only place that comma-flexibility is really important for backcompat is for (parenthesis) (er, I mean ,(,parenthesis,),)

The Mad Cacti: Backcompat? I'm confused, what are you talking about? Not improving the code, it seems, but reducing what HSpeak allows? Would you like to see a difference between foo(a,b) and foo(a,,b)? I guess I don't remember seeing anyone use too many commas.

After obversing other people leave out commas around parentheses, I'm guilty of sometimes doing the same as a joke

Bob the Hamster: I guess what I mean is that making foo(a,b) differ from foo(a,,b) would probably break amazingly few scripts since nobody ever really does that even though it is allowed. But making if(condition), then, begin differ from if(condition) then, begin would break a huge number of scripts, since we have widely circulated example code with both comma styles. Of course, it is not like we need to change the way commas are parsed. [brackets] can be parsed in a smarter way without any modifications to the way commas are parsed.

NeoTA: This is getting to be so ambitious, I must suggest you use git somehow; because it makes branching cheap and easy, versus svn (in which branching is slightly less painful than walking across hot coals barefoot.)-- IMO developing major features in a separate branch is important for avoiding build brokenness and inappropriate side-effects on people who aren't developing on it. Git also addresses the problem which has come up for OHR before with committing things 'to have them recorded' before they were really properly tested, and needing to revert those changes a few days later. (all commits are local, since git is a distributed version system; you then have plenty of time to review and revise them before pushing to the 'master' repository (if you're an authorised developer) or making a patch with 'git format-patch' and attaching it to a bug report. provides a really good synopsis. One thing I have really appreciated in my development of NOHRIO (release due, btw, once normal internet service is reconnected here and I can update the gitorious repo with local changes), is the ability to commit very selectively using git add -i (interactive), ie. select chunks of a changed file to commit, so you can carefully divide commits meaningfully rather than messily finding you committed some changes that are completely unrelated to your log message.

I'm willing to setup a git mirror of ohrrpgce SVN on if people think this idea is worth looking at... that would make it easy to play around and sneee how it would work without expending any significant effort. (I've already got a working local import of ohr SVN on my normal PC, that's why I'm confident here.)

Bob the Hamster: While I would like to keep the master repository in subversion as it is now, I think it would be great to provide a svn<->git bridge. If you do that, add it to the source page.

NeoTA: done. Notably the git page does not yet describe how to branch (git branch mybranch;git checkout mybranch) or merge (usually you would 'git checkout master; git merge mybranch'). I wasn't sure whether this was at the same complexity level as the other commands described. Anyway describes basic branching and merging.

Array assignment / data copying[edit]

Neo: I've just noticed a gap in the specification:

assigning a whole array. That is, in NumPy terms I don't mean

a = b

(which just makes a refer to b; a:= b in current HSpeak terms)

but rather

a[:] = b

(which overwrites the entire contents of a with all the data from b, assuming they are of compatible shapes.)

IMO this functionality is something you will need for sure if you introduce arrays, more than, say, things like multi-indexing (or the presumable consequent ability to do partial overwrites quickly). However, we shouldn't necessarily introduce syntax for this; maybe just a copy (a, b) function ( is so relevant.)

The Mad Cacti: You're quite right. I've left most of the basics semantics of arrays unspecified because I'm uncertain about just how copying vs. referencing should work in general (which should be the default?), although what I'm actually worried about is strings. Another question is whether to have special support for multidimensional arrays.

But I did want to add array splicing pretty much just as Python does it (though I'd rather use .. or ... instead of :) which would include that notation for array overwriting. (And of course, unlike Numpy Python doesn't require writing a slice onto one of the same shape.)

I'm wavering on multi-indexing.

NeoTA: Why did you change my user-thingy? It's just the result of writing three tildes, that's the right way to do it isn't it? FWIW, NumPy has broadcasting, so 'compatible shape' does not mean 'same shape' : a scalar gets broadcasted to any shape (as in a[:] = 1), a 4x1 array broadcasts to any 4xN shape, a 1x4 array broadcasts to any Nx4 shape, etc)

Anyway, I reason that referencing should be the default, for a) consistency with current things (slice handles, etc.. are effectively references) and dynamic languages in general, and b) reflection of the magnitude of the operation (depending on array size, copy (a, b) may be expensive; a:= b should always be cheap)

This may mean that sharing of memory between arrays is required though: consider a:= b[1] (according to my rationale above, a should now be an array of length 1 which is a view of b[1] in the numpy sense) which would also mean that b could go out of scope but we'd have to keep its data around until a goes out of scope. That's

also, splicing != slicing. slicing is taking a subset of the items from (as in seq[1:4] or seq[::2]), splicing is concatenation (seq+ otherseq; dunno how you'd do it with numpy -- hjoin() comes to mind, but I think it's something differently named than that.)

In with numpy or python, .. is a valid part of slicing too -- it means 'all' (ie Ellipsis), which is useful for dynamic or multidimensional slicing. In basic python, .. is basically equivalent to : or slice (None, None, None); in numpy, .. expands across dimensions as in arr[.., 4] which takes [:,:,:,:,....., 4] ie. is equivalent to however many dimensions remain in the array shape. (I use this in my 'pel' library to handle multidimensional images in the exact same way I handle normal 2d images) May be worth considering before deciding slice syntax.

('..' may be wrong, '...' could be correct; Haven't got my machine running to check yet.)

The Mad Cacti: Oh, I didn't change it, you never signed.

I stumbled across Ellipsis (...). Thanks for the explanation, I thought it was something deep and complicated.

The use of ... in numpy is so minor that I think it can easily be done without. Multidimensional arrays are unlikely to have much use besides dealing with maps and sprites. This is after-all an ambitious plan, and needs to be kept in check! It was never discussed exactly what our goal are for the new language (since the goal was initially just 'add arrays'), but I'd like it to be simple enough for any serious users to learn and understand in its entirety, quickly. In reality I know that many users don't even understand all of the current language, but they're not trying to! However I don't think that I should get to be the one to decide everything just because I'm implementing everything.

The slowness of copies isn't really an issue, since I'll use copy-on-write anyway, with reference counting (and even code analysis during bytecode generation (what I'm meant to be working on at the moment) to spot variables which can be deleted early).

It's true that pretty much everything in HamsterSpeak operates on handles, which have reference semantics. But this is all about adding types to the language, and the only type currently existing is the integer, which has pass-by-value semantics.

Strings are also being added, and perhaps you will argue that strings should be passed by value. I think I will, and we should think twice before having strings and arrays act differently. (BTW, I now think that totally ignoring the current 'plotstrings' is to be preferred when designing real strings.) I think you could consider arrays as much more like strings and integers than objects which exist in the game engine and are referenced through handles. Which is unlike languages in general which are primarily concerned with native objects.

I'd disagree that passing arrays around by reference is more natural. (And besides, I think few of even the advanced plotscripters have had experience with dynamic languages.) MATLAB, R, and other matrix based or mathematical languages pass arrays by value: that is, they don't distinguish between integers and arrays, which from a mathematical point of view is clearly the only logical thing to do. (Also, slices are copies, not views.)

In particular, having a totally dereferenced array (I mean your a:= b[1] example) be a reference to a single element of the array is probably going to trip up nearly everyone. And Numpy doesn't even work that way (aside from those weird scalar arrays), though it is the behave you would conclude.

Though I might sound convinced, I am definitely not. Also, having just slices create references would be a possible middle ground.

concatenate() is the general case, you're thinking of hstack() and vstack(). And there's also r_ and c_, wow, hacky!

Aside, I am now shocked to discover that the python list is just an array: I thought it did some magic to enable fast insertions/deletions/extension. Now I remember: it was the dict which had all kinds of magic optimisations. And I can no longer remember why I was thinking of optimising the special case of lists of length two in order to have nice fast lisp-style linked lists...

NeoTA: Oops, you're right, I never signed. Sorry.

+1 on making strings be passed by reference (and immutable)

'dynamic languages' are relevant because they are the implementers of dynamic types, hence they probably encode some good knowledge about them.

numpy does work that way, and scalar arrays are not a special case:

>>> a = np.zeros (shape = (4,4), dtype = 'B') >>> b = a[1] >>> b[:] = [0,1,2,3] >>> a

this will print out showing that the second row of a has changed (and reads 0,1,2,3!)

AFAICS, MatLab has (very!) similar semantics to NumPy (1 is a valid array (or 1+2j, or 1.2e10, etc), and NumPy passes references to arrays. Which is not to say that MatLab doesn't pass by value, cause it does, just that I think you are conflating the way values are passed with the end result. After all, whether you refer to values directly or indirectly, the sense of equivalence is separate. The only important difference here, I think, is that array == integer always returns a boolean array (OTOH, this matches NumPy's general 'expand it to fit the highest-dimensionality object' policy)

whether to follow MatLab or NumPy might best depend on the amount of expected pixel manipulation. For pixel manipulation, having slices be views is tremendously convenient. If you are allocating large arrays, it is also convenient (with the possibility of packing multiple 'sub-arrays' into a large array for serialization and memory saving). If your expected usage patterns are more like everyday Python list usage, slices being copies is more sensible IMO.

The Mad Cacti: What's with immutable strings? It seems to me like it's mostly a memory use optimisation. But I think it more natural to treat them as arrays of characters, and be able to write to slices or individual elements of them easily. Honestly, passing immutable strings by reference is almost identical to passing mutable strings by value, except with less operations possible on them: so what's the point! Maybe I'm doing something wrong, or it's for my own good, but I sometimes get annoyed by immutable objects in Python.

Thinking about it, I realised that the majority of dynamically typed languages that I've used store objects in variables by value, not reference: PHP, MATLAB, R, Maple, Magma, GAP, Euphoria. So "they probably encode some good knowledge about them" doesn't convince me. It's been a while since I used MATLAB but I layed around with it today, and was quite surprised at how similar matrix operations are to NumPy. I also saw the 'NumPy for MATLAB users' page. You seem to be right that references verses values does not make all that much difference to semantics; but it's what I'm focused on because it makes more difference than anything else to implementation.

Re: scalar arrays, where you show "b = a[1]" is a view of a: yes, but the original example was with a 1-dimensional, in which case you don't get a view but the value of that element. The special case I was talking about was the one where I was getting handed a memmap(48, dtype=INT), but never mind that.

I agree that it's essential that we have references to views into arrays, somehow. But this could be accomplished with a reference operator (eg. b := &a[1..4]), which would of course create something identical to a copy of a other than sharing memory. Both reference operators and the proposed copy overwriter function seem a little bit ugly to me. I'd definitely like to hear whether anyone else is off-put.

I'd expect pixel manipulation to be a very rare use of arrays, but for direct tilemap access to be semi-common: I could probably optimise all {read/write}{map/pass}block calls to array indexing. Maybe we could let users create their own fixed type multidimensional numpy style arrays too, but I think the typical array would be used like a Python list.

Bob the Hamster: I like "by reference by default. I support the existence of a copy-assignment operator of some kind. I am neutral about mutable vs immutable strings, since I have rarely written code that cared if strings were mutable or not. Speaking of dynamicly typed languages that store objects by value, have you ever tried to write a complex PHP script that doesn't leak memory? Oh. My. Goodness. The language leaks memory by design as a performance optimization! I had to do it once and it was excruciating. Even when I re-wrote the whole program to use the reference operator explicitly, I found that even the references themselves were leaking and builing up to crash-worthy levels after a few hours runtime.... I am getting off topic. I apologize :P


TMC, I understand your annoyance with immutable objects; however, what they are is a safety feature, more than an opportunity for optimization. Supposing I could use mutable objects as dictionary keys:

{[1,2] : 3, [4,5]:9}

This potentially 'loses' values: ie. a value in the hashmap for that dictionary is no longer accessible as soon as I accidentally modify the key object, because the key's hash has changed despite ostensibly no change to the dictionary occurring (I guess this is the opposite of a hash collision); the only way to avoid this is explicitly monitoring every change to dictionary keys (or having separate types for keys, but that seems just as strange and awkward.) And if you COW a dictionary/associative-array key, doesn't this mean it essentially gets lost after the COW?

For this reason, if you support associative arrays (or any other hashing-based function) I contend that immutable strings are required. (or at least allow strings to become immutable; but this particular flexibility seems a rather dangerous one to me).

There seems to be a separate issue, 'soft' (or duck) typing versus hard (or static). I'd argue if you want soft typing then passing by reference is appropriate, whereas if you want hard typing (I'm guessing you do) then passing by value is appropriate. The difference being in how much pain is required for copying operations; if we can guarantee copies are not complex (static typing does this), passing by value and COWing works ok.

Pixel manipulation is virtually the same in terms of functionality vs RWing a tilemap area; I can see being able to get a handle on a subset of a tilemap being VERY useful.

I'll have to do some checking up on 1d array indexing.

Referencing syntax (nor copy()) 'does not give me a good feeling'. It doesn't seem terrible, it just suggests to me that there is something obvious we're missing.