dalvikvm & jit - G1 Android Development

dalvikvm & jit - G1 Android Development

Hi all,
As we know dalvikvm only interpret bytecode thus we have bad performance in games and apps. If we can add jit it should speed up performance up to 10 times! I think we can look at icedtea (and shark especially) to port it to dalvikvm. What do you think about it?

Related

C#, VB, Ruby, F#, and Python on Android (via Mono and DLR)

This all works on stock/nonroot phones
I got Mono running on Android.
http://www.koushikdutta.com/2008/12/mono-on-android-success-at-last.html
Started working on Java/C# interop and found out that DLR works on Mono:
http://www.koushikdutta.com/2009/01/microsoft-dlr-and-mono-bring-python-and.html
As a result, you can write applications in Python and Ruby on Android too now.
Anyhow, if anyone else is interested in working on this project with me, please let me know! I've already gotten all the relevent source hosted at Google Code: http://code.google.com/p/androidmono. Basically the next bit of work involves implemeting a Java interop using the DLR.

Nifty.
As for Dalvik & JIT, I think dexopt already replaces some heavy usage calls with inline native code. Hopefully dalvik vm will get full JIT in the future?

This makes me very happy!!!
Thank you for your work!!

jashsu said:
Nifty.
As for Dalvik & JIT, I think dexopt already replaces some heavy usage calls with inline native code. Hopefully dalvik vm will get full JIT in the future?
Click to expand...
Click to collapse
It doesn't do any inline native replacements: dexopt optimizes the dex file. Which includes inline dex byte code replacements. There is no JIT at all, but Google said it would be something definitely on the horizon. Personally I think DEX is a pretty stupid move on Google's part; they could have just gone with CIL-- and write a Java compiler for that instead. The Mono JIT compiler works on many platforms; so V1 of Android could have been running JIT compiled native code with that route... which is an order of magnitude better in performance.
I'm going to be doing some performance comparisons of Mono vs Dalvik; Mono will obviously win, but it will be interesting to see the margin. I'll also experiment with binding Mono to the Android runtime to create Android applications in C#.
Optimization
Virtual machine interpreters typically perform certain optimizations the first time a piece of code is used. Constant pool references are replaced with pointers to internal data structures, operations that always succeed or always work a certain way are replaced with simpler forms. Some of these require information only available at runtime, others can be inferred statically when certain assumptions are made.
The Dalvik optimizer does the following:
For virtual method calls, replace the method index with a vtable index.
For instance field get/put, replace the field index with a byte offset. Also, merge the boolean / byte / char / short variants into a single 32-bit form (less code in the interpreter means more room in the CPU I-cache).
Replace a handful of high-volume calls, like String.length(), with "inline" replacements. This skips the usual method call overhead, directly switching from the interpreter to a native implementation.
Prune empty methods. The simplest example is Object.<init>, which does nothing, but must be called whenever any object is allocated. The instruction is replaced with a new version that acts as a no-op unless a debugger is attached.
Append pre-computed data. For example, the VM wants to have a hash table for lookups on class name. Instead of computing this when the DEX file is loaded, we can compute it now, saving heap space and computation time in every VM where the DEX is loaded.
All of the instruction modifications involve replacing the opcode with one not defined by the Dalvik specification. This allows us to freely mix optimized and unoptimized instructions. The set of optimized instructions, and their exact representation, is tied closely to the VM version.
Most of the optimizations are obvious "wins". The use of raw indices and offsets not only allows us to execute more quickly, we can also skip the initial symbolic resolution. Pre-computation eats up disk space, and so must be done in moderation.
There are a couple of potential sources of trouble with these optimizations. First, vtable indices and byte offsets are subject to change if the VM is updated. Second, if a superclass is in a different DEX, and that other DEX is updated, we need to ensure that our optimized indices and offsets are updated as well. A similar but more subtle problem emerges when user-defined class loaders are employed: the class we actually call may not be the one we expected to call.
These problems are addressed with dependency lists and some limitations on what can be optimized.
Click to expand...
Click to collapse

Koush said:
Personally I think DEX is a pretty stupid move on Google's part; they could have just gone with CIL-- and write a Java compiler for that instead. The Mono JIT compiler works on many platforms; so V1 of Android could have been running JIT compiled native code with that route... which is an order of magnitude better in performance.
I'm going to be doing some performance comparisons of Mono vs Dalvik; Mono will obviously win, but it will be interesting to see the margin. I'll also experiment with binding Mono to the Android runtime to create Android applications in C#.
Click to expand...
Click to collapse
Google had different objectives, they didn't go after maximum performance. Remember, that handsets have different constrains than desktops and laptops. So they went after minimizing RAM usage (byte code interpreter => maximum possible sharing of read-only memory pages among processes) and battery life. Performance had to be acceptable, not priority.
You would not be able to fit everything into RAM if you used Mono and you would get the patent problems with Net/Mono/etc as a bonus.

lu_tze said:
Google had different objectives, they didn't go after maximum performance. Remember, that handsets have different constrains than desktops and laptops. So they went after minimizing RAM usage (byte code interpreter => maximum possible sharing of read-only memory pages among processes) and battery life. Performance had to be acceptable, not priority.
You would not be able to fit everything into RAM if you used Mono and you would get the patent problems with Net/Mono/etc as a bonus.
Click to expand...
Click to collapse
.NET, C#, IL, et al are all ECMA standards. Mono is LGPL/GPL. There are no patent or licensing issues with it that is unfamiliar to the OHA. They reuse plenty of open source projects.
An interpreter is not power efficient OR performant, simply due to the fact is it doing the 10 times as much work to do the same thing as native code. In addition, Mono features an Ahead Of Time compiler (AOT) that would let you compile everything to native code before it even hits the phone (or just once, and cache it). Most of Android's power and memory optimizations currently comes from Google's application life cycle (activities can be killed and resumed at the system's whim) -- that has nothing to do with Dalvik. I'm not criticizing the API or the implementation, just the runtime.
They could have spent their time making the Mono runtime play nicely with the shared memory subsystem.
I'm rebuilding mono with a minimal configuration to check out the disk and memory footprint.

Koush said:
Not that most of you will care, but I got Mono running on Android.
Click to expand...
Click to collapse
I'm a noob.... how can I install this? Very good Job Kush!

pic.micro23 said:
I'm a noob.... how can I install this? Very good Job Kush!
Click to expand...
Click to collapse
I haven't released anything yet. I'm trying to figure out how to statically link all it's dependencies, minimize the size, bind to the Android runtime, convert DEX to CIL and then CIL to ARM, and all sorts of other goodness. Basically a lot of experimenting to do before anything is "released". It's just in proof of concept phase right now.

Dalvik sucks
http://www.koushikdutta.com/2009/01/dalvik-vs-mono.html

Koush said:
Dalvik sucks
http://www.koushikdutta.com/2009/01/dalvik-vs-mono.html
Click to expand...
Click to collapse
Lol. nice article Koush. It's surprising mono is that much faster

Koush said:
I haven't released anything yet. I'm trying to figure out how to statically link all it's dependencies, minimize the size, bind to the Android runtime, convert DEX to CIL and then CIL to ARM, and all sorts of other goodness. Basically a lot of experimenting to do before anything is "released". It's just in proof of concept phase right now.
Click to expand...
Click to collapse
Just read the dalvik vs mono article. It's certainly interesting work. I agree that by ignoring JIT they're certainly not going after the most beneficial optimizations. That said, I don't think it's something they've completely excluded from future implementation.
I think you should consider trying to get Mono added as an external project. If nothing, having unofficial support for a vm which supports C#/CIL could bring in a significant amount of developer interest from the WinMo dev community. The coretech team would be the folks to set up a new project.

I've now gotten mono working on all G1s. You don't need Debian OR root. Still a couple kinks to work out, but I have it on the market for anyone interested in playing with it. More information at the link below.
http://www.koushikdutta.com/2009/01/mono-for-android-now-available-on.html

I thought this was only a platform for development but its made my g1 much faster and reduced memory from steel and stock browsers as well. My market is still 12 mb and mono is about 11 mb. Is this normal?
Also everytime, I run mono, does it do the same thing it does the first time it was installed and opened?
Sorry, i know nothing about mono but i can tell its definitely optimizing the performance on the g1 though.
great work koushe,
hbguy

This awesome, I have been waiting to see how this all turned out after reading your first post about it a few days ago.

hbguy said:
I thought this was only a platform for development but its made my g1 much faster and reduced memory from steel and stock browsers as well. My market is still 12 mb and mono is about 11 mb. Is this normal?
Also everytime, I run mono, does it do the same thing it does the first time it was installed and opened?
Sorry, i know nothing about mono but i can tell its definitely optimizing the performance on the g1 though.
great work koushe,
hbguy
Click to expand...
Click to collapse
Koush would know more than I would, but installing mono shouldn't affect everything else on Android. It's not like everything is suddenly using mono instead of dalvik. I suspect you have a strong case of the placebo effect

hbguy said:
I thought this was only a platform for development but its made my g1 much faster and reduced memory from steel and stock browsers as well. My market is still 12 mb and mono is about 11 mb. Is this normal?
Also everytime, I run mono, does it do the same thing it does the first time it was installed and opened?
Sorry, i know nothing about mono but i can tell its definitely optimizing the performance on the g1 though.
great work koushe,
hbguy
Click to expand...
Click to collapse
Yeah, you're just imagining things I haven't even attempted like DEX->CIL yet.
The first APK of Mono is quite large though. I've updated it with a number of bug fixes and also am making it use eglib now. This trimmed the size by a few MB. Getting Mono to work with Bionic might not be possible... (that would trim off another 2MB).
Once again, the APK is just a developers release... something to play with and test.

I have been messing with mono on my G1.
Is it safe to say it will only work with command line apps?
I got my own hello world and a few other things running, but if I try and run any sort of gui I get errors.

Yes, only command line stuff will work. WinForms will not work on Android.
However, you should be able to get OpenGL ES working via PInvoke. I haven't tried it, but it should work just fine.

Koush said:
Yes, only command line stuff will work. WinForms will not work on Android.
Click to expand...
Click to collapse
That is what I thought, I think PInvoke is just a bit out of my skill set.
Thanks for the work none the less.

I got mono building in the Android build environment, and the Mono team accepted a patch to make it work on Android. There's also some changes external to Mono which can be found at the androidmono google code repository:
http://code.google.com/p/androidmono/

Graphics APIs for WM6

Hello,
I've spent the last month trying desperately to find a free 2D (or 3D, but not required) graphics API I can use for high performance games on Windows Mobile. I initially set about trying to find a managed API to use, but now I've broadened my search to include any API (that I can call from C++ or .NET), and I'm still struggling to find anything.
The options seem to be:
- GDI: not nearly fast enough for high performance games
- DirectDraw: probably OK, but doesn't seem possible to use this on my HTC Touch Pro 2 due to memory problems (see http://blogs.msdn.com/windowsmobile/archive/2009/04/17/twisted-pixels-3-memory-mysteries.aspx -- I've got the same problem and have not yet found any way to work around this)
- Direct3D: no hardware driver on my Touch Pro 2, this renders about 0.2 frames per second in the samples, which is not good enough
- OpenGL: I've tried and tried, and can't get any samples working for this. The closest I've found is the tutorials here: http://www.zeuscmd.com/tutorials/opengles/index.php, but these all fail with an error, "Unable to create OpenGL|ES context" as soon as I run them (or alternatively using the "Ug" version, I get no window appearing at all).
Does anyone have any suggestions as to how I can progress from here? I really want to write some Windows Mobile games, but I can't even get started. :-(
Thanks,
Adam.

For 'better' graphics performance have a look at the 2D/3D Driver Development for MSM720X devices!
After installation of this driver pack try my OpenGL test app (source also available) found in my sig! Installing this driver pack will increase d3d performance, too!

Hi heliosdev,
Many thanks for posting the sourcecode for your OpenGL test -- I have that compiling very nicely here and producing very promising results too. I think this may finally be the answer I've been looking for.
Do you have any idea how much of a performance hit using PInvoke to interact with OGLES is likely to be? I don't know whether PInvoke is slow to use or not, but it strikes me that it may be slower use it hundreds of times per frame compared to coding directly in C++ and not needing to PInvoke at all..?
Thanks again,
Adam.

For comparison NuShrike implemented torus test in C++. As you can see the difference is 'minimal' even NuShrike optimized it using vertex buffer objects (I'm using standard vertex arrays and just triangles, i.e. no triangle strips).

[WARNING] Compiler optimizations!

I just noticed this again and it worries me alot! Devs here are usually posting their changes as being different than others, more even about compiler optimizations and yet most of you make huge mistakes.
Personally I've been running Gentoo linux for about ten years now and what I learned from running different setups with space limitations is that one should NEVER use -Os and -O3 together as this will not only take much more time to compile it also GROWS executables.
-Os is basically the same as -O3, but it adds some flags that won't benefit executable size vs memory usage.
In general executables compiled with only -Os are nearly as optimized as -O3/2 and resulting in smaller executable. While compiled with -O3 uses more memory allocation for extra speed. While using both together will result in larger executables, more memory usage AND much longer compile times vs just -O2.
So devs can you all please remove -Os from your "optimizations" and leave just -O2? Bragging about "more compiler optimizations" really fails if you fail to understand why the -Ox optimizations levels have been added in the first place... Which is to NOT use them together!
Also -O3 might introduce more IO operations, which might have a negative effect on our snappies.

The choice: Summarised
-O2 is the default optimization level. It is ideal for desktop systems because of small binary size which results in faster load from HDD, lower RAM usage and less cache misses on modern CPUs. Servers should also use -O2 since it is considered to be the highest reliable optimization level.
-O3 only degrades application performance as it produces huge binaries which use high amount of HDD, RAM and CPU cache space. Only specific applications such as python and sqlite gain improved performance. This optimization level greatly increases compilation time.
-Os should be used on systems with limited CPU cache (old CPUs, Atom ..). Large executables such as firefox may become more responsive using -Os.
Click to expand...
Click to collapse
From: http://en.gentoo-wiki.com/wiki/CFLAGS

Quote from the wiki articale you provided:
"Note that only one of the above flags may be chosen. If you choose more than one, the last one specified will have the effect."

tincmulc said:
Quote from the wiki articale you provided:
"Note that only one of the above flags may be chosen. If you choose more than one, the last one specified will have the effect."
Click to expand...
Click to collapse
My point is that in general people here seem to use "-Os -O3", which is dumb really as it had no benefits. -O2 should be better performance wise on our phones.

great post, thank you man!

Now silently pray for our devs to change their -O3 -Os to -O2.
Or I might need to do my own branch of "optimized" nightlies.
For example what you declare with -march=A -mcpu=B will intact be -march=A -mtune=B.
edit: I think the gcc compiler even throws a warning about l2p etc.

Wouldn't the best choice be just -Os? I would have thought that optimising for systems with limited cache would be appropriate for portable devices.

Pipe Merchant said:
Wouldn't the best choice be just -Os? I would have thought that optimising for systems with limited cache would be appropriate for portable devices.
Click to expand...
Click to collapse
I highly doubt executable sizes on our phones are big enough to benefit from -Os.
In the case of my router at home -Os doesn't give any benefit at all, in fact packet inspection is slightly slower vs -O2.

KSM, does it really improves performance ?

Well sadly i don't have an answer for that question yet...
I'm trying to think of a way to put KSM to the test on my android device.
As far as i understand it is possible that the kernel actually causes high CPU usage trying to map and unmap memory pages over and over again.
This issue is known for linux and other virtual machines so it is possible that the Same effect will be on the android vm
Testings that i found are not relevant to android.
For example:
The result is a dramatic decrease in memory usage in virtualization environments. In a virtualization server, Red Hat found that thanks to KSM, KVM can run as many as 52 Windows XP VMs with 1 GB of RAM each on a server with just 16 GB of RAM. Because KSM works transparently to userspace apps, it can be adopted very easily, and provides huge memory savings for free to current production systems. It was originally developed for use with KVM, but it can be also used with any other virtualization system - or even in non virtualization workloads, for example applications that for some reason have several processes using lots of memory that could be shared.
Click to expand...
Click to collapse
http://kernelnewbies.org/Linux_2_6_32
What i would really want to know is what would happen if each of these VMs Would run a different application/game/audio/graphics software at the same time ? or what if the same vm will run many different apps ? and also to compare cpu usage with and without KSM
Guess i'll need a tool for that. something like 'iostat' but for memory diagnostic and another tool to see a per process CPU usage but 'top' is not good enough for that.
Any way, the best test should present clear results with precised data.
I'll keep looking for legit way to put it to the test.
If you can think of a way to test KSM with android, please let me know.

This is a technique that relates mostly to processes like virtualisation. For example, when you load 5 windows XP VMs, you'll have a good 10 - 20 services that are practically the same in memory in each VM. Instead of each service using 10mb (ie, 10mb x 5 = 50mb), you only need say 15 or 20mb using KSM. If you use different applications, it is very unlikely that anything would be saved FOR THAT APPLICATION. However, the main elements of a Windows XP System would still be there (drivers, explorer, firewall, logon, search and so on). Means little in one setup, but when you have several VMs it is shown to be a huge advantage. As we know a simple XP install can use 500mb of RAM actively, and this is fairly uniform across instals.
With android, i don't know if there are specific RAM savings to be had. Don't know enough about the inner workings and the sandbox android puts its apps in or how apps interact with system services. Sadly, i can't think of a good way to test it out either, but i'll be keeping an eye on this topic for someone (much) more knowledgeable to come along.

Harbb said:
Sadly, i can't think of a good way to test it out either, but i'll be keeping an eye on this topic for someone (much) more knowledgeable to come along.
Click to expand...
Click to collapse
Enter bedalus, stands there with a vacant expression on his face. Harbb looks disappointed.
kernels ; battery ; ROM ; gov/sched

That entire paragraph was dedicated to you bedalus, we both know that.

Lol
I hope someone can answer this though.
kernels ; battery ; ROM ; gov/sched

Wait for someone............
Sent from my Nexus S using xda premium

KSM does not improve performance on Android just like that - all enabling KSM does, is enable SUPPORT for the Feature but Applications would have to make use of the feature, which they don't.
You can easily verify this like that :
echo 1 > /sys/kernel/mm/ksm/run
<wait and/or run the Applications of your choice>
cat /sys/kernel/mm/ksm/pages_sharing
IF the above shows a value > 0 then you are making use of KSM else it's just available, without anyone using the feature.
Here's an interesting Article that gives a little more insight :
http://www.linux-kvm.com/content/using-ksm-kernel-samepage-merging-kvm
By the way, the same is true for ZCACHE. If you really want to make better use of your Memory (RAM) then using ZRAM as a Swapdevice does work (and may often make sense, too).
That all said : There appear to be efforts to make use of KSM http://forum.xda-developers.com/showthread.php?t=1464758 - so things may well change ...

any update on this...?

Optimize Options using the GNU Compiler Collection

I haven't found a good source of discussion on this topic but has anyone experimented with gcc optimization flags when compiling the kernel or a ROM? I dug around and found sparse information on the subject and rarely is Android specifically mentioned. Just curious if anyone has 2 cents on the subject.
I've been trying to figure out how to implement using these flags in kernel compilation but I'm confused if you're supposed to put these in your makefile (both main one and the arch\arm\... makefile?)
is ffast-math preferred over some of the Linaro blog post flags ("We’ve switched from a base of -O2 -fno-strict-aliasing to -O3 -fmodulo-sched -fmodulo-sched-allow-regmoves.")
I've tried sifting through Linaro git repos to see anyone actually using any flags they mention but have yet to find anyone using them yet they say they do somewhere.
GCC Optimization Flags Reference http://gcc.gnu.org/onlinedocs/gcc-4.0.4/gcc/Optimize-Options.html
Some background information on the Linaro blogs about -o3 make flag
Compiler flags used to speed up Linaro Android 2011.10, and future optimizations
People trying out the latest Linaro Android builds may notice they’re faster than the older versions. One of the reasons for this is that we’re using a new set of compiler flags for this build.
We’ve switched from a base of -O2 -fno-strict-aliasing to -O3 -fmodulo-sched -fmodulo-sched-allow-regmoves.
To optimize application startup, we’ve also switched the linker hash style to GNU (ld --hash-style=gnu), and patched the Android dynamic linker to deal with those hashes.
Getting rid of -fno-strict-aliasing enables rather efficient additional optimizations – but requires that the code being compiled follows some rules traditionally not enforced by compilers.
Given the traditional lack of enforcement, there’s lots of code out there (including, unfortunately, in the AOSP codebase) that violates those rules.
Fortunately, gcc can help us detect those violations: With -fstrict-aliasing -Werror=strict-aliasing, most aliasing violations can be turned into built time errors instead of random crashes at runtime. This allowed us to detect many aliasing violations in the code, and fix them (where doable with reasonable effort), or work around them by enabling -fno-strict-aliasing just for a particular subdirectory containing code not ready for dropping it.
-O3 enables further optimizations not yet enabled at -O2, including -finline-functions, -funswitch-loops, -fpredictive-commoning, -fgcse-after-reload, -ftree-vectorize and -fipa-cp-clone.
While some of them are probably not ideal (e.g. -finline-functions tends to increase code size, thereby also increasing memory usage and, in a bad case, reducing cache efficiency), overall -O3 has shown to increase performance.
-fmodulo-sched and -fmodulo-sched-allow-regmoves are fairly new optimizations not currently automatically enabled at any -O level. These options improve loop scheduling – more information can be found here.
We optimized further by adding link-time optimizations in some relevant libraries, and by using -ffast-math in selected parts of the code. -ffast-math is a bit dangerous because it can cause math functions to return incorrect values by ignoring parts of the ISO and IEEE rules for math functions (parts that make optimizations harder, or that simply require additional checks that slow down things considerably), so it should usually not be used for an entire build – but enabling it for parts of the code verified to not rely on exact ISO/IEEE rules can produce quite a speedup.
We also added the option to specify extra optimizations in a board specific config – so now our Panda builds are optimized specifically for Cortex-A9 CPUs while the iMX53 builds optimize for Cortex-A8 instead of relying on the least common denominator.
We also experimented with Graphite related optimizations, such as -fgraphite-identity, -floop-block, -floop-interchage, -floop-strip-mine, -ftree-loop-distribution and -ftree-loop-linear – those optimizations can rearrange loops to allow further optimizations. We’ve turned them back off for the release because they introduced some stability problems, and the benefit seemed minimal.
However, chances are they will come back in a future build. They can help make more efficient use of multi-core CPUs (with the addition of -ftree-parallelize-loops) once the compiler knows how to to threading (currently, Android is built using a generic arm-eabi targeted compiler that has no knowledge of the underlying OS).
Other possible future improvements – though probably not as efficient as the ones already made, or the switch to -ftree-parallelize-loops for multi-core boards – include seeing what effect a switch between ARM and Thumb2 instructions has on performance (enabling the right mode in the right modules), identifying places where the increased code size of -O3 actually hurts, and add the likes of -fno-inline-functions there, identifying further performance critical parts that can handle -ffast-math, fixing the aliasing violations we’ve worked around this time, and – of course – switching to gcc 4.7 when it becomes usable.

make -o3 -o -fmodulo-sched -o -fmodulo-sched-allow-regmoves -Wl,--hash-style=gnu -Werror=strict-aliasing ARCH=arm CROSS_COMPILE=~/(toolchain path)/bin/arm-linux-gnueabihf- zImage modules -j 4
is this how you'd do a make according to the instructions on the linaro blog? This is the only way I could do it for it to actually run something but it doesn't tell me that it's passing these flags

Database Info

welcome