[Troubleshooting]No more random Freezes & Reboot - Galaxy Note 10.1 Q&A, Help & Troubleshooting

[Troubleshooting]No more random Freezes & Reboot - Galaxy Note 10.1 Q&A, Help & Troubleshooting

Many of you (and I) experienced random freezes & reboots but found no solutions, even after flashing so many different ROMs.
Hopefully I've found the cause & the SOLUTION.
I've been flashing custom ROM from the first time I own a gadget but recently after I flash Gnabo 8 I experienced something strange: Random Freezes & Reboots.
In my case, random reboots occur after 31 hours of usage.
After I analyzed the problem I concluded that the cause is rogue apps running in the background & hogging the CPU to an extend that the device can't be used anymore.
Then the freezes & reboots occur.
After 1 month of analyzing further, I found 3 processes in my N8000 that become the Prime Suspects:
System Server
dhd_dcp
kswapd0
I consulted to Andi (Lord Boeffla) & Mikyno about this before eliminating the PRIME SUSPECTS to just 1 rogue app.
My knowledge in Linux + Google come handy in this matter & concluded that kswapd0 is the cause(?)
kswapd0 is responsible for creating swap file a.k.a virtual memory in case the real RAM is not enough.
In Linux, there is a separate partition that is dedicated for virtual memory a.k.a swap partition but in android, specially our N80XX, I can't find any swap partition.
How come kswapd0 active & consume so much CPU time & causing freezes+reboots where there is no swap partition?
Further researches bring me these information:
In my current ROM (ARHD 21.0) my sysctl.conf is empty, BUT...
using Android Tuner, I find that vm.swappiness is 130!!
vm.swappiness is a setting value to control kswapd0 on when to write the RAM content into disk/internal memory.
vm.swappiness=0 is to tell to avoid writing to disk/internal memory as much as possible.
vm.swappiness=100 is to tell to write to disk/internal memory as much as possible (avoiding RAM).
I don't know how the sysctl.conf is empty but the vm.swappiness=130 is set, but I modified it to vm.swappiness=0.
Then I reboot my N8000.
Now my N8000 is running without incidents for 3 days & 20 hours straight.
I hope this is the final solution I wanted & I also hope that this information can help others with the same problem too.

Hi Darkknight,
after some amount of time now, can you tell me something about "long term experience"? Has this solved your rebooting issues or do they come back after a few days?
Thanks in advace!

Related

shell32.exe CPU load vs. SMS usage

Hi @all,
in the last weeks with my HTC touch dual, i discovered a worse loss of performance during 1-2 days - resulting in the need of a softreset. Unlike some of you i won't accept a daily softreset as a "must" for a WM device.
Looking into the processes with SKTools, i notice a constantly raising CPU load of the process "shell32.exe", which may need 40% CPU load or more just in idle state, making the phone really laggy. I've read about this somewhere in the web and some forums, but i did not feel everybody is suffering from this issue.
So what about it? As a SMS flat user i'll write many SMS per day with my wife - and this seems to be the crux of the matter. Playing an hour with my phone, opening and closing apps, entering dates and contacts - and shell32.exe is still at 0% in idle mode. Writing and receiving 40-50 SMS and i'm at 20% or more.
I just want to set up an discussion about that ... maybe some of you will have the same problem. And maybe some of the clever guys here will have an idea how to fix this ... the problem still exists in WM6.0 and unfortunately still in 6.1
Any discussion / ideas appreciated ...

İ have the exact same problem and was googling about it finding your comment... when i first boot my device cpu usage is %0 when idle and after sending/receiving around 50-60 sms , it increases up to %25 when idle by shell32.exe . I wonder what's it got to do with cpu ...

"Nice" to read someone else with this problem. It seems, that most people make a restart every day or does not write much SMS.
In holiday, my HTC ran 2 weeks without restart and problems ... but abroad you have to pay for SMS and there were only 5-10 of them. But with a flat ...
It seems that Microsoft still did not noticed these problems ...

I actually do restart everyday my htc prophet as it does reduce battery life having cpu at 20-30%
I guess we can solve this problem by comparing software we use and find what could be the problem. Because ,as there is not a lot people suffering from this, it might be a software issue.
well i heard that shell32.exe was also for today plug-ins , i have only bat.status as today plugin. May be i should disable it and try again... Actually i am abit lazy

You can spend this time to other things ... ... i made all these tries. A "naked" Touch Dual with no plugins makes the same - it's an issue of the operating system and/or the SMS app, which is a part of it.

well then these pdas are just not for sms-lovers Or shall we beg for a fix from microsoft ...

I'm experiencing the same problem. And i'm not a heavy user of SMS.
I think it happens even when not a single SMS have benn sent or received.
It seems to happen since I have flashed a 6.1 ROM.
I remark this because my battery is discharging more than usual.

I have the same problem on my new Palm Treo Pro, even without sms sending. Cpu usage is not that high though, but sometimes the battery is almost empty at the end of the day without doing anything special.
Monitoring power consumptions with BatteryStatus (now HomeScreen++) has a best case of 45mA (still almost double of my P3600) and more often than wanted around 150mA-250mA, with moderate backlight and idle state.
If I remember correctly, the P3600 had a similar issue and that was linked to the radio, it was solved by a newer radio rom. But that my be coincidence or just a total different thing too.

I'm having the same problem with my treo 800w. If i reset my phone to begin the day, about halfway through the day wihtout using the phone at all I will notice the battery dropping rediculously fast. Looking at the processes reveals that shell32.exe is using 20% cpu constantly. After a reset, shell32.exe hovers at 0-2%
I am positive this is what is causing my battery to die so fast.

Hi
I've got a similar problem with my FSC Pocket Loox N560.
I got a RAM extension which additionally needs an upgrade to WM6.1 (at least that's what i got).
When i reset the Loox, everything is fine.
But as soon as the device is switched off to suspend mode and then on again, shell32.exe takes up between 30 and 70% of CPU power. And there is another process, filesys.exe, which uses another 20 to 30%. So in the worst case, the CPU is at about full load and - of course - the processor mode is constantly at turbo mode. The system's reaction is pretty poor then. One can see the sector-colored-circle-thing appearing every 2 seconds on today screen and some other applications.
Haven't found a reason for this. I would be quite happy for now if i only needed to reset the device once per day, but in fact i have to reset it everytime i switch it on.
Does anybody know a reason for this yet?
Greetings, Martin

sipple said:
Hi
I've got a similar problem with my FSC Pocket Loox N560.
I got a RAM extension which additionally needs an upgrade to WM6.1 (at least that's what i got).
When i reset the Loox, everything is fine.
But as soon as the device is switched off to suspend mode and then on again, shell32.exe takes up between 30 and 70% of CPU power. And there is another process, filesys.exe, which uses another 20 to 30%. So in the worst case, the CPU is at about full load and - of course - the processor mode is constantly at turbo mode. The system's reaction is pretty poor then. One can see the sector-colored-circle-thing appearing every 2 seconds on today screen and some other applications.
Haven't found a reason for this. I would be quite happy for now if i only needed to reset the device once per day, but in fact i have to reset it everytime i switch it on.
Does anybody know a reason for this yet?
Greetings, Martin
Click to expand...
Click to collapse
I have the exact problem on same device (FSC LOOX n560). Upraded it to WM6.1 about two weeks ago. Everything was running ok until yesterday, when I installed and then uninstalled latest version of Pocket Informant from WebIS and MyMobiler remote desktop (which I uninstalled right after this problem started, because I thought it is, what's causing it).
My guess is that it has to do with some filesystem errors on storage card (after all it's FAT system, which fragments it's data). I'll try to format my storage card later today, to see if it helps
EDIT: Okay, it seems it was caused by misbehaviour of SPB Pocket Plus v4 program. I uninstalled it and turned my PDA on and off several times and shell32.exe process was still on 0-1%
EDIT2: Well, I made a hard reset and installed everything again, including SPB PP and everything is running ok

Jupiman said:
EDIT: Okay, it seems it was caused by misbehaviour of SPB Pocket Plus v4 program. I uninstalled it and turned my PDA on and off several times and shell32.exe process was still on 0-1%
EDIT2: Well, I made a hard reset and installed everything again, including SPB PP and everything is running ok
Click to expand...
Click to collapse
It was any particular installation order?
1.SPB Pocket Plus v4
2.other software

yep, I installed SPB PP as a first application, then the rest

hey every1
even i am havin d same prob on my htc diamond..but on a minor scale.. I have installed few softwares recently and have noticed dat shell32.exe is taking arnd 0.37 % continously..but i compltly remember that b4 installing and even in all the roms(i have tried many roms) it used to be only at 0% in idle state.. but this is continously taking abt 0-1% which is also affecting me as i wud like to know the reason.so have anyone found the reason and a way to resolve widout a hard reset. bcoz i know once i hard reset its gonaa be normal. but i install many softwares and putting them back is a pain...

hi guys. I have the same problem when I use the usb cable to connect as a masive disk and I finish using or not "extract securely" in my PC.
sorry my english!!

I have this problem too!
when my phone is turning on: shell32.exe=> Ram usage about 2MB, CPU usage 0% or 1%!
after receive several sms, shell32.exe=> Ram usage about 10MB , CPU usage about 40% to 60%!!
I installed "BestTaskMan" and found shell32.exe in processes list, i select it and in menu button i click "Windows list"..
in "Windows list", I found a invisible windows with "Clock Window" name!!
i checked it, each time I receive sms, a "Clock Window" was added tu "Windows list"!!
I think, its the main problem!
My Questions: whats "Clock Window"?
why is it invisble?
why will greate a "Clock Window" when I receive a new sms?!!
How to close it?!

[Q] Need help tracking down resource leak in WM6

At least I think it must be a resource leak. I have multiple WM6 devices that all behave the same way when my application is run. After a while, maybe 15 hours, they gradually deteriorate and refuse to start other applications. At first, it will just be an application like Opera that will not start. Eventually, things like File Explorer will also not start. When I say they won't start, I mean that I get the "not signed with a trusted certificate or one of its components cannot be found" message.
My first thought was that there was a memory leak, but according to the output of GlobalMemoryStatus, the system's memory use is not increasing over time. Then I thought that it might be storage space since I'm generating a big log file. But the storage space still sits above 60MB when this happens.
Restarting the device gets everything back to normal. So far, this is what I know:
1. I'm using the RIL. I noticed today that after about 12 hours, I stop getting RIL notifications.
2. I'm monitoring memory with GlobalMemoryStatus, but the available physical memory doesn't seem to be decreasing over time
3. The thread count remains constant for the life of the application
4. My storage space is decreasing, but there is still over 60MB available when everything starts to go wrong
5. In the end, the device winds up with the screen lock on, even though it is not configured.
It seems that there must be some kind of resource leak. The only other thing I can think of are kernel resources. I tried to rule out things like event handles through static code inspection, but maybe there's something I'm missing.
Does anyone have any suggestions as to how I would troubleshoot this further? I'm using VS2008 and a Tilt2 and an HTC Imagio (it happens on other devices as well).

I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.

sbaker25 said:
I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.
Click to expand...
Click to collapse
hi!can somebody to help me?
i have asus p535 and i reset "start""settings"default settings"and i lost all from device.after apear align screen and remain like this
i think i have to install window mobile.can you tell me step by step how?
10000 thanks

Sorry, accidental re-post. Don't think I can delete it altogether...

Observations as to why market breaks / force close, and other anomolies

As I suspected early on the issues boil down to corruption within the User Data or Cache partitions, less often on the system partition due to an unexpected shutdown of the device. Shut on these devices need to follow the proper shutdown routine as any linux environment. Following this best practice will ensure that all data is written out to its corresponding file system by flushing all cache, unmounting the file system, etc..
Here are the culprits of why we see so frequent random Force Closes, Market Resetting, etc. ultimately resulting in an unclean shutdown, corrupting some data.
1. The button we use is also a forced off button. Typically if you hold it down too long you are powering off the device.
2. Some times when in sleep mode you see the Viewsonic logo upon starting - that means that the system shutdown (most likely crashed).
3. If your running Vegan your hitting the reboot.. I dont know for sure but I suspect this is NOT performing a clean shutdown... (I dont have a copy of the source)
Anyway... wanted to pass this on... as last night my data partition became corrupt after using the Reboot function on the Poweroff menu of Vega 5.1..

shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing

My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.

roebeet said:
My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.
Click to expand...
Click to collapse
I have been on stock since I got the device just moving to the newer versions when they come as OTAs and have never ever had to mess with my partition, so I don't THINK the issue is in the stock software. In fact, the only problems I've ever encountered were when I used the enhancement pack, in which case my screen started to become unresponsive and the calibration.ini I was told to try did not work. Since then I went back to 3389 and the device has been perfect ever since.
I could be wrong though and just very, very lucky....here's to hoping. Another thing to consider is maybe the memory is going bonkers for some reason. I've had flash memory that lasted forever and I've had flash memory that has gone wacky over a period of 6 months....even a wipe by the utility designed to do it doesn't fix it properly. I don't know how CWM wipes or partitions the memory, I do know there's supposed to be a special way to do it.
If it's not faulty memory off the bat, then that leaves something in the 'extras' being put into these ROMs. Maybe some of the newer tegra drivers or some coding to make the ROMs faster - I'm just saying, can't leave any stone unturned.
Has anyone that has stayed loyal to stock encountered these issues? We have to ask that question I think. Then we ask how many of the people playing with ROMs are seeing the issues, this would include people that have used CWM to partition and mess with their mounts initially.
I can say I've never seen data disappear from my internal memory or my SD and I can also say I've never seen multiple FCs except after putting in the enh. pack (keep in mind I got my tab on Dec 20something, so I had 3053 and then 3389 soon after).
The first sign of anything being 'corrupted' on it's own at stock and I'll be sending mine back. As an owner of Android since Android's been around, I've never had my G1 or MT4G (or any smartphone before it) become corrupted due to not being shutdown or reboot properly and while this is a tablet I think the fundamentals should be the same. Pampering 'faulty' memory is a risk. You can wipe and re-do all you want, but if it's faulty it's going to stay that way.

Ive done that but I guess you can say unfortunately I have had only clean shutdowns since then... The last corruption I had I formatted my data and cache partitions before I ran logcat.... Of course thought of that afterward....
Generally if any has FCs, etc. etc. run a logcat and post it here... we will be able to confirm this...
We could change the way the partitions are created and add a sync which will further reduce chances BUT will take a performance hit...
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I reviewed the out of the box framework source on the google GIT and technically if a reboot command is given a clean shutdown is performed via the framework... but the widget on the shutdown screen I suspect is not calling the method properly or is not being called at all... All speculation at this point... But for sure there is corruption occurring..
Since the last corruption I switch over to pershoots kernel... Even though his kernel seems to be a little slower he seems to have included the latest drivers which other items relate to data integrity (im reading into the release notes).
NEO: The first thing I did when I got my device install CW, Vegan... Updated Kernels also... Never had an issue until the first time (yes about a day ago) I used the reboot feature of Vegan. That corrupted my user data. I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
But again.. between the wrongly placed power switch, the unprovoked reboots (ie viewsonic screen showing when trying to wake up the device) and the reboot button possibly not performing a proper shutdown will sure increase the chances in a wider distribution of users. So it may not be a CW issue and just some poor design.
When I have time today I will verify if the reboot function performs a clean shutdown... if anyone has the time please post the logcat... Im going to be running around today and will try to get to it..
watson540 said:
shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing
Click to expand...
Click to collapse

stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
Click to expand...
Click to collapse
AFAIK they're running yaffs ATM. Next move is to ext4...
Read some articles about this several weeks ago, apparently many apps do not properly flush file caches. One of the articles was a Google developer post about file corruption along with their API method which did a cache flush prior to a close, then a bit later was the Google indication that they were planning to move to ext4 FS to further help alleviate the problem.

stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
Click to expand...
Click to collapse
That's my point. How many times since we've had our Android and smart phones have we had situations where they are turned off or rebooted without the proper procedures? Power drains till they die, they drop and reboot, we clog them up with stuff or some app drives them nuts and they reboot or shut off....Yet you rarely if ever hear about a phone's data being 'corrupted' with stock software. Sure it may happen with official OTAs etc, but never just off-the-bat like what's happening with the G-Tab. But it's not happening to everyone either so I'm just looking to see if there's a pattern.
Even since the G1 and newer phones, you don't really hear about or see file corruption issues on stock software with these phones. It's when users start going to ROMs that you hear of issues cropping up. That's not to say it doesn't happen at all at stock, I just think we're seeing it in a more concentrated fashion here because of all the formatting, re-partitioning, etc. At first you hear, 4GB is a great partition size, then you hear there are problems so move to 2048, then you hear 256MB swap, then no swap since Android doesn't use it. Then dataloop for speed, then no dataloop because of critical issues. Rules and instructions change almost on a daily basis. I think it's more than these poor flash drives can take I find sometimes it's good to keep it simple.
I owned a Vibrant for a while...decided it was a PoS when at stock I was seeing bad lag (because of Sam's terrible FS). People said...do the speedhack, it'll be fast!, but what was the caveat? Having to reboot the phone almost weekly, sometimes several times a week, and people were seeing what? Data corruption. That's not for me. Give me something that is lag free (doesn't have to be a bullet train, just don't skip on video or audio and make sure my live wallpaper and drawer animation is fluid and I'm happy!). Point being....keeping it simple may help to alleviate some of the issues. If people are seeing these problems with stock, then you're absolutely right and it would be a point of contention that the failsafe isn't working right.
Otherwise it seems the stock OS on these things are able to self correct in most situations and it may just be some of the many tweaked features in these ROMs doing something it shouldn't - or, I may just be very lucky indeed.
I'm still dying to get the OTA - I haven't seen one since 3899 yet.

eMMC sudden death research

Update from Feb 17th:
Samsung has started to upgrade eMMC firmwares on the field - only for GT-I9100 for now.
See post #79 for additional details.
Update from Feb 13th:
If you want to dump the eMMC's RAM yourself, go ahead to post #72.
I'm looking for a dump of firmware revision 0xf7 if you've got one.
-----------------------
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?

Oranav said:
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?
Click to expand...
Click to collapse
:crying: --> **Ultimate GS3 sudden death thread** :crying:
Just wanted to link to a prior thread with some information/testing that as been done. Completely understand if you nuke it because it doesn't meet the proper criteria or is way to noobish to be posted here. Anyway, just though it _might_ help, so giving it a shot..

So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.

Oranav said:
So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.
Click to expand...
Click to collapse
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli

odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.

odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
Right, haven't spotted this. Thanks for the observation.
Self preservation sounds possible.
Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
This could be possible - this patch looks like a quick and dirty fix, so maybe they didn't have the time to properly fix this. Instead, they just avoid the bug absolutely (with the cost of data corruption).
But I don't think this would cause lockups - I believe the chip has a watchdog...
All in all, I think the best thing we can do right now is to dump the whole firmware out of it. I will do it soon.

there is also a chance that this is just a temporary workaround to prevent further bricking - until there is a final fix.
As of now we asume that this is the fix as it directly adresses the eMMC in concern, but all this is just based on asumptions.

Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
I think we can prove that the fix is actualy locking into loop ,but you must risk your phone :/ . If you are in ->
1) Flash back to older version without the fix
2) Wait and pray
a) If your phone dies --> the guys were right about the fix and the loop
b) If it stays alive ,then ...
We wont know for sure ,but your phone is maybe in the "perfect" condition for the test :/ .
(Sorry if this makes no sence)

@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert

Rob2222 said:
I have no idea if it's that way, I just wanted to post it as theory to think about.
Click to expand...
Click to collapse
The problem is that there are too many theories imaginable, but I can't think of no way to prove them but to reverse engineer the MoviNAND firmware.
Rob2222 said:
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
Click to expand...
Click to collapse
Certainly not. Watchdogs are slow, drivers running on a Cortex-A9 are blazing fast.
But I do think Linux's MMC driver can handle device restarts during an MMC operation.

Rob2222 said:
@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert
Click to expand...
Click to collapse
if those freezes are really caused by the firmware fix, I don't see how the would disappear over time...
I mean if it really is the case that the fix trades data corruption for eMMC survival, it would make sense to see freezes... but depending on what data is affected, they should only be treatable by reinstalling the affected app or deleting its data/cache.
updated theory:
for all we know the error condition where the eMMC dies is quite rare, since most devices have been used for month before they passed away. So under the assumption that the error condition appears randomly and that there is a chance of data corruption every time the condition appears with fixed kernel, we could expect to see freezes and other problems some time after the fix was applied. So that would explain the raising number of freezes reported. Furthermore I'd assume that people getting freezes would try to do something about it, like reinstalling/deleting apps wiping caches and/or data... or even reflashing, thus repairing the corrupted data. So freezes would disappear.
Wait, doesn't most evidence point to the fact that the error condition does NOT appear in a random fashion, since there were no cases in the beginning, and then a lot all of a sudden? Well, it might be that this is just the way we perceive the issue. Maybe there were cases before, but they weren't reported... phones died, people sent them in, got new ones and went on with their lives. But after some time the issue got known... bloggers wrote about it... and so on... people realized their phones died because of a wider problem... voila, steep raise in reported cases. Also the number of dying S3s would simply rise by a rising number of overall S3s, I mean Samsung kept selling phones, right?
But even under the assumption the bug is related to wear-levelling and not random, here is another idea: I have no clue how the algorithms work, but maybe it uses some sort of pseudo-random data to do whatever, with the same seed on all eMMCs... and thus all of them go through the same series of numbers. And now imagine the error condition is only triggered by a specific number or number set (say someone screwed up a boundary condition). Under this theory the error condition wouldn't appear randomly, but after a certain amount of write ops (or something).
Another question I asked myself is: shouldn't there be cases were data corruption does damage beyond all repair except for reflashing?
Well, it might be, but it seems reasonable to assume that it is a lot less likely than user-data corruption, since most critical files on the phone shouldn't be opened writeable (or are on a read-only mounted partition in the first place), hence shouldn't be affected by ****-ups during writes.
Like the previous poster I want to add that this is most likely all bull****... but it is what came to my mind looking for a theory that supports the data we got.

Okay, got a RAM dump
I won't post it here (or anywhere else for that matter) because I don't want to get sued by Samsung.
I might release a kernel which allows you to dump the RAM yourself if there's enough demand, but I don't want to right now, because:
1. The code is ugly as hell, not implemented as a kernel module, not thread-safe etc.
2. It is highly dangerous (messing with the eMMC chip - I really don't know how much stable this thing is), so if you want to do it on your device, you should be an expert. In that case, you can write the code yourself (with little effort)
Anyway, I hope the FTL is Whimory, since I'm familiar with it. Would be easier.
I'll let you know if I find anything interesting.
PS I've attached a little teaser. (Yes, this is the patched function. 0x40300 is red because I've opened a partial RAM dump.)
EDIT - Some initial results:
0. The CPU is a Cortex-M3.
1. No strings at all Just some uninteresting release asserts ("REL_ASSERT")
2. Found the Smart Report generator function -> found the MMC command handlers.
3. Most MMC commands handlers are stored in a function table. There are 3 special commands: MMC60, MMC62, MMC64. Depends on the arguments these special commands are provided, they modify the function table (this is the so called "vendor mode").
4. There are a lot of possible arguments for MMC62, not the only ones we know.
5. If you trace back the function they patch all the way up the call stack, you get to MMC24 and MMC25 handler. These commands are MMC_WRITE_BLOCK and MMC_WRITE_MULTIPLE_BLOCK. Since the function they patch is deep down the call stack, it's very likely that it is the wear level.
Anyway, because of the lack of strings I guess it would be very hard to truly understand the SDS bug we're facing

Odp: eMMC sudden death research
i cant say i have an idea whats going on inside emmc but usually in this case of mistakes/failures debug or diagnostics code is used for release.
maybe some debug info repeatedly written triggers wear levelling failure
so fix has to simply disable it

Awesome research.
So we're dealing with bug in exactly the same eMMC subsystem as in faulty SGS2 eMMC chips, but in device that was released after proving SGS2 eMMCs to be faulty.
Oranav, for some reason I cannot send you PMs. Could you send me your dump? Does your eMMC come from faulty serie?

Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.

owl74 said:
Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
Click to expand...
Click to collapse
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.

I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app

Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Click to expand...
Click to collapse
Damn Oranav, nice work!
So does it mean you get a total screen freeze? Every time?
BR
Robert

thealgorithm said:
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
Click to expand...
Click to collapse
About 2 weeks. Forgot to mention i reebooted the phone before I charged it. I will return this phone before it dies on me.... I think I will get an S3 but I will check if it has a new chip... otherwise I will return again and stick to my desire hd which is already running S3 rom....
---------- Post added at 02:24 PM ---------- Previous post was at 02:22 PM ----------
Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app
Click to expand...
Click to collapse
So everyone's speculation is right about thw fix causing the freeze...

AW: eMMC sudden death research
Suppose this fix addresses wear leveling. If firmwares without this fix wear out the eMMC would then not the device still boot and then crash? As far as I know flash is still readable but not writable any more when worn out. Could it be that the wear leveling algorithm has a problem so that after some time it replaces cells from the bootloader and that causes the death?
In short: I want to know if it had a negative effect using the old firmware for some time because that old software caused extreme aging for the eMMC.

[Q] RK3188-based N90FHDRK: Corrupt Flash Memory???

Dear experts
I am very new to Android, and this N90FHDRK is my very first Android device ….
So for my N90 I am using Glassrom V2a, and I have tried several others as well.
Since the last 2-3 weeks I face strange problems:
- Tablet does not boot
- Tablet does not switch off (goes off only after pressing on/off for 30sec)
- Invalid unlock-code
- Battery-status unknown
- Disappearing applications and wallpapers
- Apps crashing
- ‘cannot mount /cache or /data’ in CWM-mode
Of cause reinstalling and wiping everything helps, but the problems come back once a day … Once in a while just reformatting /data helps.
What could be the root-cause:
- Some application corrupting parts of the (internal 16GB) flash-memory – but which app? I might even restore the same app again and again
- Corrupt flash-memory (with cells loosing data, or not writable)
- Memory-controller
Looking at the internal 16GB ‘corrupt flash-memory’: in the ‘PC-world’ I would run some ‘chkdsk’, and I also know that during low-level-formating bad blocks are recognized and marked at ‘bad’
Is something similar available for Android or the underlying Linux operating system?
Even if I manage to identify a bad-block, can I mark it as ‘bad’, and would this marking survive a restore?
Of cause I could simply send the tablet back to China, but that brings other problems and potential risks ….
Thank you for your help

Tiemichael said:
Dear experts
I am very new to Android, and this N90FHDRK is my very first Android device ….
So for my N90 I am using Glassrom V2a, and I have tried several others as well.
Since the last 2-3 weeks I face strange problems:
- Tablet does not boot
- Tablet does not switch off (goes off only after pressing on/off for 30sec)
- Invalid unlock-code
- Battery-status unknown
- Disappearing applications and wallpapers
- Apps crashing
- ‘cannot mount /cache or /data’ in CWM-mode
Of cause reinstalling and wiping everything helps, but the problems come back once a day … Once in a while just reformatting /data helps.
What could be the root-cause:
- Some application corrupting parts of the (internal 16GB) flash-memory – but which app? I might even restore the same app again and again
- Corrupt flash-memory (with cells loosing data, or not writable)
- Memory-controller
Looking at the internal 16GB ‘corrupt flash-memory’: in the ‘PC-world’ I would run some ‘chkdsk’, and I also know that during low-level-formating bad blocks are recognized and marked at ‘bad’
Is something similar available for Android or the underlying Linux operating system?
Even if I manage to identify a bad-block, can I mark it as ‘bad’, and would this marking survive a restore?
Of cause I could simply send the tablet back to China, but that brings other problems and potential risks ….
Thank you for your help
Click to expand...
Click to collapse
Too bad, It seems I am quite alone with this problem

+1. If anyone has any expertise with this specific tablet that would be greatly appreciated. I think the problem we are facing is similar and that is I want to go to stock and which the .img file is released, its just the matter of using it. I am used to CyanogenMod for my nexus 4 where i would simply have the file already signed and used using recovery mode. But this .img file is a lot different and if someone could point in the right direction on how to do it that would be wicked. P.S. its so unlike my Nexus 4 with the actual seeing the android open and picking bootloader and all that, its jus a black screen and the computer recognizes it.

Database Info

welcome