eMMC sudden death research - Galaxy Note II, Galaxy S III Developer Discussion

eMMC sudden death research - Galaxy Note II, Galaxy S III Developer Discussion

Update from Feb 17th:
Samsung has started to upgrade eMMC firmwares on the field - only for GT-I9100 for now.
See post #79 for additional details.
Update from Feb 13th:
If you want to dump the eMMC's RAM yourself, go ahead to post #72.
I'm looking for a dump of firmware revision 0xf7 if you've got one.
-----------------------
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?

Oranav said:
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?
Click to expand...
Click to collapse
:crying: --> **Ultimate GS3 sudden death thread** :crying:
Just wanted to link to a prior thread with some information/testing that as been done. Completely understand if you nuke it because it doesn't meet the proper criteria or is way to noobish to be posted here. Anyway, just though it _might_ help, so giving it a shot..

So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.

Oranav said:
So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.
Click to expand...
Click to collapse
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli

odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.

odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
Right, haven't spotted this. Thanks for the observation.
Self preservation sounds possible.
Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
This could be possible - this patch looks like a quick and dirty fix, so maybe they didn't have the time to properly fix this. Instead, they just avoid the bug absolutely (with the cost of data corruption).
But I don't think this would cause lockups - I believe the chip has a watchdog...
All in all, I think the best thing we can do right now is to dump the whole firmware out of it. I will do it soon.

there is also a chance that this is just a temporary workaround to prevent further bricking - until there is a final fix.
As of now we asume that this is the fix as it directly adresses the eMMC in concern, but all this is just based on asumptions.

Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
I think we can prove that the fix is actualy locking into loop ,but you must risk your phone :/ . If you are in ->
1) Flash back to older version without the fix
2) Wait and pray
a) If your phone dies --> the guys were right about the fix and the loop
b) If it stays alive ,then ...
We wont know for sure ,but your phone is maybe in the "perfect" condition for the test :/ .
(Sorry if this makes no sence)

@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert

Rob2222 said:
I have no idea if it's that way, I just wanted to post it as theory to think about.
Click to expand...
Click to collapse
The problem is that there are too many theories imaginable, but I can't think of no way to prove them but to reverse engineer the MoviNAND firmware.
Rob2222 said:
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
Click to expand...
Click to collapse
Certainly not. Watchdogs are slow, drivers running on a Cortex-A9 are blazing fast.
But I do think Linux's MMC driver can handle device restarts during an MMC operation.

Rob2222 said:
@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert
Click to expand...
Click to collapse
if those freezes are really caused by the firmware fix, I don't see how the would disappear over time...
I mean if it really is the case that the fix trades data corruption for eMMC survival, it would make sense to see freezes... but depending on what data is affected, they should only be treatable by reinstalling the affected app or deleting its data/cache.
updated theory:
for all we know the error condition where the eMMC dies is quite rare, since most devices have been used for month before they passed away. So under the assumption that the error condition appears randomly and that there is a chance of data corruption every time the condition appears with fixed kernel, we could expect to see freezes and other problems some time after the fix was applied. So that would explain the raising number of freezes reported. Furthermore I'd assume that people getting freezes would try to do something about it, like reinstalling/deleting apps wiping caches and/or data... or even reflashing, thus repairing the corrupted data. So freezes would disappear.
Wait, doesn't most evidence point to the fact that the error condition does NOT appear in a random fashion, since there were no cases in the beginning, and then a lot all of a sudden? Well, it might be that this is just the way we perceive the issue. Maybe there were cases before, but they weren't reported... phones died, people sent them in, got new ones and went on with their lives. But after some time the issue got known... bloggers wrote about it... and so on... people realized their phones died because of a wider problem... voila, steep raise in reported cases. Also the number of dying S3s would simply rise by a rising number of overall S3s, I mean Samsung kept selling phones, right?
But even under the assumption the bug is related to wear-levelling and not random, here is another idea: I have no clue how the algorithms work, but maybe it uses some sort of pseudo-random data to do whatever, with the same seed on all eMMCs... and thus all of them go through the same series of numbers. And now imagine the error condition is only triggered by a specific number or number set (say someone screwed up a boundary condition). Under this theory the error condition wouldn't appear randomly, but after a certain amount of write ops (or something).
Another question I asked myself is: shouldn't there be cases were data corruption does damage beyond all repair except for reflashing?
Well, it might be, but it seems reasonable to assume that it is a lot less likely than user-data corruption, since most critical files on the phone shouldn't be opened writeable (or are on a read-only mounted partition in the first place), hence shouldn't be affected by ****-ups during writes.
Like the previous poster I want to add that this is most likely all bull****... but it is what came to my mind looking for a theory that supports the data we got.

Okay, got a RAM dump
I won't post it here (or anywhere else for that matter) because I don't want to get sued by Samsung.
I might release a kernel which allows you to dump the RAM yourself if there's enough demand, but I don't want to right now, because:
1. The code is ugly as hell, not implemented as a kernel module, not thread-safe etc.
2. It is highly dangerous (messing with the eMMC chip - I really don't know how much stable this thing is), so if you want to do it on your device, you should be an expert. In that case, you can write the code yourself (with little effort)
Anyway, I hope the FTL is Whimory, since I'm familiar with it. Would be easier.
I'll let you know if I find anything interesting.
PS I've attached a little teaser. (Yes, this is the patched function. 0x40300 is red because I've opened a partial RAM dump.)
EDIT - Some initial results:
0. The CPU is a Cortex-M3.
1. No strings at all Just some uninteresting release asserts ("REL_ASSERT")
2. Found the Smart Report generator function -> found the MMC command handlers.
3. Most MMC commands handlers are stored in a function table. There are 3 special commands: MMC60, MMC62, MMC64. Depends on the arguments these special commands are provided, they modify the function table (this is the so called "vendor mode").
4. There are a lot of possible arguments for MMC62, not the only ones we know.
5. If you trace back the function they patch all the way up the call stack, you get to MMC24 and MMC25 handler. These commands are MMC_WRITE_BLOCK and MMC_WRITE_MULTIPLE_BLOCK. Since the function they patch is deep down the call stack, it's very likely that it is the wear level.
Anyway, because of the lack of strings I guess it would be very hard to truly understand the SDS bug we're facing

Odp: eMMC sudden death research
i cant say i have an idea whats going on inside emmc but usually in this case of mistakes/failures debug or diagnostics code is used for release.
maybe some debug info repeatedly written triggers wear levelling failure
so fix has to simply disable it

Awesome research.
So we're dealing with bug in exactly the same eMMC subsystem as in faulty SGS2 eMMC chips, but in device that was released after proving SGS2 eMMCs to be faulty.
Oranav, for some reason I cannot send you PMs. Could you send me your dump? Does your eMMC come from faulty serie?

Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.

owl74 said:
Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
Click to expand...
Click to collapse
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.

I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app

Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Click to expand...
Click to collapse
Damn Oranav, nice work!
So does it mean you get a total screen freeze? Every time?
BR
Robert

thealgorithm said:
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
Click to expand...
Click to collapse
About 2 weeks. Forgot to mention i reebooted the phone before I charged it. I will return this phone before it dies on me.... I think I will get an S3 but I will check if it has a new chip... otherwise I will return again and stick to my desire hd which is already running S3 rom....
---------- Post added at 02:24 PM ---------- Previous post was at 02:22 PM ----------
Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app
Click to expand...
Click to collapse
So everyone's speculation is right about thw fix causing the freeze...

AW: eMMC sudden death research
Suppose this fix addresses wear leveling. If firmwares without this fix wear out the eMMC would then not the device still boot and then crash? As far as I know flash is still readable but not writable any more when worn out. Could it be that the wear leveling algorithm has a problem so that after some time it replaces cells from the bootloader and that causes the death?
In short: I want to know if it had a negative effect using the old firmware for some time because that old software caused extreme aging for the eMMC.

Related

[Q] Need help tracking down resource leak in WM6

At least I think it must be a resource leak. I have multiple WM6 devices that all behave the same way when my application is run. After a while, maybe 15 hours, they gradually deteriorate and refuse to start other applications. At first, it will just be an application like Opera that will not start. Eventually, things like File Explorer will also not start. When I say they won't start, I mean that I get the "not signed with a trusted certificate or one of its components cannot be found" message.
My first thought was that there was a memory leak, but according to the output of GlobalMemoryStatus, the system's memory use is not increasing over time. Then I thought that it might be storage space since I'm generating a big log file. But the storage space still sits above 60MB when this happens.
Restarting the device gets everything back to normal. So far, this is what I know:
1. I'm using the RIL. I noticed today that after about 12 hours, I stop getting RIL notifications.
2. I'm monitoring memory with GlobalMemoryStatus, but the available physical memory doesn't seem to be decreasing over time
3. The thread count remains constant for the life of the application
4. My storage space is decreasing, but there is still over 60MB available when everything starts to go wrong
5. In the end, the device winds up with the screen lock on, even though it is not configured.
It seems that there must be some kind of resource leak. The only other thing I can think of are kernel resources. I tried to rule out things like event handles through static code inspection, but maybe there's something I'm missing.
Does anyone have any suggestions as to how I would troubleshoot this further? I'm using VS2008 and a Tilt2 and an HTC Imagio (it happens on other devices as well).

I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.

sbaker25 said:
I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.
Click to expand...
Click to collapse
hi!can somebody to help me?
i have asus p535 and i reset "start""settings"default settings"and i lost all from device.after apear align screen and remain like this
i think i have to install window mobile.can you tell me step by step how?
10000 thanks

Sorry, accidental re-post. Don't think I can delete it altogether...

Observations as to why market breaks / force close, and other anomolies

As I suspected early on the issues boil down to corruption within the User Data or Cache partitions, less often on the system partition due to an unexpected shutdown of the device. Shut on these devices need to follow the proper shutdown routine as any linux environment. Following this best practice will ensure that all data is written out to its corresponding file system by flushing all cache, unmounting the file system, etc..
Here are the culprits of why we see so frequent random Force Closes, Market Resetting, etc. ultimately resulting in an unclean shutdown, corrupting some data.
1. The button we use is also a forced off button. Typically if you hold it down too long you are powering off the device.
2. Some times when in sleep mode you see the Viewsonic logo upon starting - that means that the system shutdown (most likely crashed).
3. If your running Vegan your hitting the reboot.. I dont know for sure but I suspect this is NOT performing a clean shutdown... (I dont have a copy of the source)
Anyway... wanted to pass this on... as last night my data partition became corrupt after using the Reboot function on the Poweroff menu of Vega 5.1..

shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing

My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.

roebeet said:
My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.
Click to expand...
Click to collapse
I have been on stock since I got the device just moving to the newer versions when they come as OTAs and have never ever had to mess with my partition, so I don't THINK the issue is in the stock software. In fact, the only problems I've ever encountered were when I used the enhancement pack, in which case my screen started to become unresponsive and the calibration.ini I was told to try did not work. Since then I went back to 3389 and the device has been perfect ever since.
I could be wrong though and just very, very lucky....here's to hoping. Another thing to consider is maybe the memory is going bonkers for some reason. I've had flash memory that lasted forever and I've had flash memory that has gone wacky over a period of 6 months....even a wipe by the utility designed to do it doesn't fix it properly. I don't know how CWM wipes or partitions the memory, I do know there's supposed to be a special way to do it.
If it's not faulty memory off the bat, then that leaves something in the 'extras' being put into these ROMs. Maybe some of the newer tegra drivers or some coding to make the ROMs faster - I'm just saying, can't leave any stone unturned.
Has anyone that has stayed loyal to stock encountered these issues? We have to ask that question I think. Then we ask how many of the people playing with ROMs are seeing the issues, this would include people that have used CWM to partition and mess with their mounts initially.
I can say I've never seen data disappear from my internal memory or my SD and I can also say I've never seen multiple FCs except after putting in the enh. pack (keep in mind I got my tab on Dec 20something, so I had 3053 and then 3389 soon after).
The first sign of anything being 'corrupted' on it's own at stock and I'll be sending mine back. As an owner of Android since Android's been around, I've never had my G1 or MT4G (or any smartphone before it) become corrupted due to not being shutdown or reboot properly and while this is a tablet I think the fundamentals should be the same. Pampering 'faulty' memory is a risk. You can wipe and re-do all you want, but if it's faulty it's going to stay that way.

Ive done that but I guess you can say unfortunately I have had only clean shutdowns since then... The last corruption I had I formatted my data and cache partitions before I ran logcat.... Of course thought of that afterward....
Generally if any has FCs, etc. etc. run a logcat and post it here... we will be able to confirm this...
We could change the way the partitions are created and add a sync which will further reduce chances BUT will take a performance hit...
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I reviewed the out of the box framework source on the google GIT and technically if a reboot command is given a clean shutdown is performed via the framework... but the widget on the shutdown screen I suspect is not calling the method properly or is not being called at all... All speculation at this point... But for sure there is corruption occurring..
Since the last corruption I switch over to pershoots kernel... Even though his kernel seems to be a little slower he seems to have included the latest drivers which other items relate to data integrity (im reading into the release notes).
NEO: The first thing I did when I got my device install CW, Vegan... Updated Kernels also... Never had an issue until the first time (yes about a day ago) I used the reboot feature of Vegan. That corrupted my user data. I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
But again.. between the wrongly placed power switch, the unprovoked reboots (ie viewsonic screen showing when trying to wake up the device) and the reboot button possibly not performing a proper shutdown will sure increase the chances in a wider distribution of users. So it may not be a CW issue and just some poor design.
When I have time today I will verify if the reboot function performs a clean shutdown... if anyone has the time please post the logcat... Im going to be running around today and will try to get to it..
watson540 said:
shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing
Click to expand...
Click to collapse

stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
Click to expand...
Click to collapse
AFAIK they're running yaffs ATM. Next move is to ext4...
Read some articles about this several weeks ago, apparently many apps do not properly flush file caches. One of the articles was a Google developer post about file corruption along with their API method which did a cache flush prior to a close, then a bit later was the Google indication that they were planning to move to ext4 FS to further help alleviate the problem.

stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
Click to expand...
Click to collapse
That's my point. How many times since we've had our Android and smart phones have we had situations where they are turned off or rebooted without the proper procedures? Power drains till they die, they drop and reboot, we clog them up with stuff or some app drives them nuts and they reboot or shut off....Yet you rarely if ever hear about a phone's data being 'corrupted' with stock software. Sure it may happen with official OTAs etc, but never just off-the-bat like what's happening with the G-Tab. But it's not happening to everyone either so I'm just looking to see if there's a pattern.
Even since the G1 and newer phones, you don't really hear about or see file corruption issues on stock software with these phones. It's when users start going to ROMs that you hear of issues cropping up. That's not to say it doesn't happen at all at stock, I just think we're seeing it in a more concentrated fashion here because of all the formatting, re-partitioning, etc. At first you hear, 4GB is a great partition size, then you hear there are problems so move to 2048, then you hear 256MB swap, then no swap since Android doesn't use it. Then dataloop for speed, then no dataloop because of critical issues. Rules and instructions change almost on a daily basis. I think it's more than these poor flash drives can take I find sometimes it's good to keep it simple.
I owned a Vibrant for a while...decided it was a PoS when at stock I was seeing bad lag (because of Sam's terrible FS). People said...do the speedhack, it'll be fast!, but what was the caveat? Having to reboot the phone almost weekly, sometimes several times a week, and people were seeing what? Data corruption. That's not for me. Give me something that is lag free (doesn't have to be a bullet train, just don't skip on video or audio and make sure my live wallpaper and drawer animation is fluid and I'm happy!). Point being....keeping it simple may help to alleviate some of the issues. If people are seeing these problems with stock, then you're absolutely right and it would be a point of contention that the failsafe isn't working right.
Otherwise it seems the stock OS on these things are able to self correct in most situations and it may just be some of the many tweaked features in these ROMs doing something it shouldn't - or, I may just be very lucky indeed.
I'm still dying to get the OTA - I haven't seen one since 3899 yet.

Kernel panic - not syncing: Fatal exception in interrupt

I am one of the many who have been experiencing the random reboots. I have seen talk about it, but have not seen anyone really looking into why this is happening. Some people claim it happens only when docked, or when SD card is in etc. Yet others post that they still get the reboots without doing those things either.
I have been monitoring my reboot problem very closely. I have yet to determine the cause other than it only happens when the device is put into sleep mode manually or automatically, and I am looking for some help from some of the DEV's around here.
When our TF's do this reboot, it is a system crash. When this happens, a ROMDUMP file is placed on the internal "sd card".
These can be viewed with a simple txt editor, like windows notepad. I myself can not read the code and understand what info it is revealing to me. According to an Asus tech on the phone this file can tell you what went wrong and made the device reboot. However the buggers won't tell you crap over the phone and want me to send the device in with the ROMDUMP files.
When I try and read the files, I do see one thing in common, in 99% of them, right near the end of the file, or the very last line before the crash, this line is present,
Kernel panic - not syncing: Fatal exception in interrupt
<2>[ 162.985309] CPU1: stopping
If our reboot issue is kernal based, which would indicate it's a firmware issue;
I was thinking one of the talented DEV's around here could fix us up.
Hell maybe even just a reflash of the current firmware would fix the issue.
Anyway, if a DEV around here want to or willing to look into this, I have some ROM dumpfiles they can look at, just send me a PM.
For reference,
I have a B60K modle
Stock 3.1
GPS 1.3.1
Wifi 5.1.42
BluT 6.17
Kernal 2.6.36.3-00001-gf377a2b [email protected] #1
Build HMJ37.US_epad-8.4.4.5.2-20110603
Thanks.

I don't have any more dumps recently, deleted them so I can't pull up and see what mine said to give you, but wanted to just say I was having these multiple times a day every day and it started once I bought an AData 16GB SDCard for the dock. Then I ended up removing that card and bought a MicroSD 16GB card instead and it has quit doing the random reboots, so definitely seemed to be something with my SDCard in the dock.

Post your whole log here (as .txt or .zip) and I will look at it.
I've had these once or twice but have always deleted the file.
The Kernel Panic is the kernel's way of telling you that something unrecoverable has happened and the integrity of the whole OS is in question. Think of a kernel panic like a BSOD on Windows.
I've never seen that specific one before, but a quick Google search indicates it may be a problem with I/O operations - like bad RAM or a bad SD card.
sassafras

Thanks for the response. I have included 4 RAMDUMP files. I find these 4 special because they all happened in quick succession. Four separate reboots all within 8 mins of each other without any interaction of the device myself. I never touched the device, I just sat there and staring at the device rebooting 4 times in 8 mins. On the final reboot the device never came back on. AT this point I picked up the device and had to hold the power button down for over 10 seconds for the device to come back on to an Asus splash screen. This was mins after I did a fresh factory reset via the OS options internally then a hard reset using the hardware buttons.

...It's a bug alright...
It doesn't seem to be caused by the same problem though, just that the watchdog program invokes a kernel panic and reboots. Weird. I'll backtrace it later and see what's up.
sassafras

went a whole day without a reboot. I did have an odd lock up/freeze at the lock screen where i couldnt unlock the device or get it to rotate the screen. It was locked up tight. Held hte power button down for 20 secs before it shut down. Rebooted, no new RAMdump created. No issues since.
sassafras_, Did you have any luck reading those ramdumps?

I did - sort of.
They're all related to the watchdog program assuming it's soft locked up. Which it may very well have been, but since you weren't using the device at the time, it's hard to know for sure.
The function's that were called immediately prior to the fault were different, which to me indicates that it's just buggy software. Honestly, without doing a backtrace I wouldn't know, but I can't without a system.map from around the time of the lockup. I'm going to assume it's just buggy code from 3.1 and wait and see if the 3.2 release lowers the rate of these. If not, then maybe I'll do some more digging.
sassafras

sassafras_ said:
I did - sort of.
They're all related to the watchdog program assuming it's soft locked up. Which it may very well have been, but since you weren't using the device at the time, it's hard to know for sure.
The function's that were called immediately prior to the fault were different, which to me indicates that it's just buggy software. Honestly, without doing a backtrace I wouldn't know, but I can't without a system.map from around the time of the lockup. I'm going to assume it's just buggy code from 3.1 and wait and see if the 3.2 release lowers the rate of these. If not, then maybe I'll do some more digging.
sassafras
Click to expand...
Click to collapse
Is there any progress on this issue? I bougth a brand new tf and during day random reboots maybe 50 times. And that romdumps are appeared on my internal storage. I dont have external sd by the way. Im stuck.

Hi.
Im having a same problem with my Transformer. Its a week old B60 and its reboots probably 50 times a day and give me log files.
Also im using Honeycomb 3.2
I really want to find out what is going on

i guess its a hardware issue or something.
i'm going to give back my TF today and take back a new one.
if i get same errors, i'll let you know.

I posted a workaround that helps immensely for rooted tablets somewhere around here. I can't find it tonight, but it's in one of the other 'random reboot' threads.
sassafras

sassafras_ -
Did you ever find anything with this issue? I am on my second TF and it is exhibiting the same random reboot while sleeping issue as the first. I know you have a post on another thread indicating how to tell the kernel to ignore "oops" conditions - have you received any feedback on how that is working? I assume this requires root access, I haven't yet rooted my device.
I have collected a few ramdump log files, but as of now only one out of 6 shows a kernel panic. I am new to Android, and I am trying to make sense of the dump logs. It appears that these dumps are maintained in a ring buffer, so the last entries are usually somewhere in the middle, is that correct? All of them also have some garbage at the end, but I assume that is just another effect of the ring buffer strategy.
Like I said, I am new to Android, but I am a long time embedded and real-time programmer, and pretty handy in Linux. It seems to me that the log files aren't providing enough information, but I'm not sure how to debug kernel/system crashes in Android. If anyone could point me in the right direction of where I should look next to get more information on these crashes, perhaps we could get to the bottom of this problem.
From what I can tell via the logs, when the TF is sleeping, it wakes up from time to time for various reasons, then suspends when it is done. It looks like it is during this wake/suspend cycle that something occasionally goes wrong and causes the tablet to reboot.
I am hoping that this is a software/firmware issue (or a hardware issue that can be worked-around with software), because I really like the TF platform and this issue makes keeping apps like IM or email running while it the device sleeps kinda iffy.
Any help from the awesome experts here at XDA would be greatly appreciated, and I look forward to learning more of the gory details and inner workings of Android.

I have had the same issues. Configuring the kernel to ignore oops only helped a bit. The tf would still freeze in standby eventually (once a day or so). My supplier (i.e. not Asus) replaced it and my new tf (a SBK v2 one, unfortunately) has not rebooted once in 2+ weeks. So my guess is that it was a hardware issue (memory, something not coming out of backup mode properly, ...?). Not sure if one could work-around it in software.
Now, this was probably not very helpful but I thought I'd share my experience here. And possibly my tf suffered from an entirely different defect, although the symptoms were the same (ramdump logs from random reboots in standby, independent from wifi on/off, sync on/off, and lots of other settings I tried).

flipflipflip -
Thanks for your reply! I was hoping that it wasn't a hardware issue, and since I got two in a row with the identical problem I was thinking that maybe a software fix could get around it. After reading about your experience, I went ahead and returned it and ordered another one from a different source. Hopefully the third time's a charm!
I'm keeping my fingers crossed that this one is not an SBK v2, but I'll be happy just to have one without sleep-apnea!
This did give me a chance to load up ADB and poke around a bit under the hood of the last one, so if nothing else it is a learning experience. Hopefully I will have something to contribute to the community once I get my hands on a working device.

I know it's been a while (had a big work-related headache), but just wanted to post and let people know that I finally received a TF101 (B50!!) that seems to be working just fine - so I guess it was just a combination of bad luck and a hardware issue after all.
The only issue I have now is that sometimes when it is sleeping, it loses its internet connection (it still seems to be connected to the AP) - but I think I can work around that.
Cheers!

[Q] eMMC crash - possible reasons and solutions

Hello everyone.
I've been looking around here for some time, reading all that suff about eMMC chips burning on Desire S. That fact dissapoints me as I was aiming to buy the gadget myself. However, I didn't find any general solution or even investigation of the case, so I'm trying to develop some kinda stuff. Let me summarize main points that we have so far.
1) The faulty guy is usually Samsung eMMC-type BGA chip KLM4G2DE (2 Gb NAND flash), however Sundisk chips were also found to burn.
2) The problem is rather hardware than software dependent as it is observed without any corellation to hboot/flash installed.
3) It was noticed that in many cases eMMC fault followed extraction-insertion of battery after phone freeze.
4) HTC doesn't recognize this as defective case and no improvements to hardware are made in new revisions of motherboard (MB) as there have been cases (at least one) when the same phone after warranty repair crashed again after some time.with the same eMMc chip installed
5) Other phones with the same eMMC installed (e.g. Sensation) doesn't experience same problems.
What can I deduce out of all this stuff and my own experience?
As soons as the case seems to be non-software dependent it should be the chip or some other hardware that drives it wrong. As soons as the chip itself seems to be OK (see 5) I beleive that it is poor motherboard design that burns the chip down. eMMC is rather bomb-proof architecture combining the memory itself and the memory controller on the same crystal. Two major ways to drive it wrong are:
1) Supply incorrect clock pulses to clock bus
2) Supply incorrect power (current/voltage/voltage slope) to memory and/or controller
The first assumption seems not very possible as clock usually comes from centralized source controlled by oscillator. If the clock is wrong, the emmc fault wouldn't be the only problem
The second point seems rather reasonable as Samsung eMMC power-up guide (see file attached) directly points out the importance of accurate power supply (especially power-on slope!), otherwise memory faults are inevitable.
That's all I can deduce so far, unfortunately there's no photos/schematics of desire s on the web to analyze the connection of emcc chip to MB. What can I suggest to prove/disprove all the stuff I wrote:
1) Can someone brave disassemble his Desire S and make high resolution photos of both sides of motherboard? This may help in further analysis.
2) Can someone even more brave and being on close terms with oscilloscope try to measure power-up voltage slope on Vсс and VccQ inputs of eMMC chip(see document attached) ? May be we are just having one of the issues described in the document.
UPDATE
I found the datasheet four our chip, find it attached to this post

Also, strangely, a common way to trigger the 'dead emmc' is when you hit 'update all' apps in the android market, regardless of what ROM/market version your on..
Surely it must be a combination of a software fault/problem too?

As far as I got it, some kinda software problem causes phone to hang during market update. Lots of users tend to solve this by removing/re-inserting the battery which leads to burned emmc.

Yeah.. that would be the case. It kind of sucks knowing you could potentially brick your phone by just updating apps from the market.
And if you're on a custom ROM, you're screwed
Sent from my HTC Desire S using xda premium

From my experience the eMMC in the "fried" cases is not actually faulty but simply does not allowing write access. Usually it is accompanied by /cache or /cache + /data corruption. When only /cache is the problem it is fixable, but when /data is affected there is no way to write a bit on the internal memory.
Unfortunately I have no confirmed explanation to this...
Just in theory when updating several apps from the Market (or other activities requiring use of /cache partition) it is possible that the /cache is filled with data and the device stucks at a point when it has no more space available to write. Rebooting to recovery and wiping /cache solves the problem. But if in that moment, when the app is downloading to /cache and another app is written from /cache to the /data partition at the same time, disconnecting the power source (battery pull) can interrupt the process making this partition unavailable (example: if you take out your USB flash drive from the PC while writing data on it there is a great chance to destroy it - tested myself ) The ext4 file system provides a protection for such cases by the way it is managing the writing process - reference here.
In my opinion all of the bricked devices (famous "fried eMMC") reported in this forum are easily repairable with JTAG and a skilled technician, but unfortunately there are no such cases reported here. Personally I do not have the knowledge, equipment and intention to do such experiments myself.
This is my logic based on my observations while trying to assist people in this forum to solve this issue. For some of them it was successful, for others - not.
I hope that my post will make a contribution to the general picture.
Regards,
Stefan

amidabuddha said:
The ext4 file system provides a protection for such cases by the way it is managing the writing process.
Click to expand...
Click to collapse
Hi, thanks for sharing your thoughts. So you say those who uses EXT4 should be safe? Here's a screenie from my wifes DS.
{
"lightbox_close": "Close",
"lightbox_next": "Next",
"lightbox_previous": "Previous",
"lightbox_error": "The requested content cannot be loaded. Please try again later.",
"lightbox_start_slideshow": "Start slideshow",
"lightbox_stop_slideshow": "Stop slideshow",
"lightbox_full_screen": "Full screen",
"lightbox_thumbnails": "Thumbnails",
"lightbox_download": "Download",
"lightbox_share": "Share",
"lightbox_zoom": "Zoom",
"lightbox_new_window": "New window",
"lightbox_toggle_sidebar": "Toggle sidebar"
}
Also, my SGSII got broken and now i'm thinking to get one of Desire S for myself. How's situation with those fried eMMC's, was there a lot of reports over here? What chances are to get it "fried"? Sorry, i don't have much time to go trough all DS forums.

Isn't Gingerbread ext4?

al89nut said:
Isn't Gingerbread ext4?
Click to expand...
Click to collapse
At the present moment I think that HTC switched to ext4, at least when looking the update_script of the new OTA 2.10.401.8:
Code:
# Script Version: G2.3
mount("ext4", "EMMC", "system", "/system");
mount("ext4", "EMMC", "/dev/block/mmcblk0p28", "/system/lib");
...
mount("ext4", "EMMC", "userdata", "/data");
EDIT: after checking my current file system (Stock 2.10.401.8) it appeared that all partitions (system, data, cache and devlog) are ext4
but when I purchased my device with version 1.28.401.1 and got the guts to S-OFF it (at the end of July) all my partitions were ext3 and I converted them manually using 4EXT Recovery. Maybe flashing a custom ROM converted the partitions from ext4 to ext3 I do not know...
I am not claiming 100 % accuracy in this information, but Revolutionary supports only hboot 0.98.0000 (from version 1.28.401.1) and hboot 0.98.0002 (version 1.47.401.4) so I suppose the most of the fault cases around are users like me that S-OFFed with ext3 but most of them that faced this problem had the ClockworkMod recovery (flashed by the Revolutionary exploit) which was not offering file system conversion at that time (not sure about the latest versions) and never converted as I did.
Cannot say for sure also for the Stock users that prefer to send the device for repair after failure with Market updates instead of making a mess with custom recoveries and RUU flashing. Maybe is this case a /cache wipe will do the trick (BTW the Stock recovery has an option "wipe cache")...or like this guy that fixed it with a hard reset and a new SDcard.
More or less the fact that the new version of the Market is updating the apps one by one and has a button to stop all updates at once is made by a reason...

al89nut said:
Isn't Gingerbread ext4?
Click to expand...
Click to collapse
My wife's DS originally had:
system - EXT4
data - EXT3
cache - EXT3
The phone was purchased 3-4 months ago and had 1.28.401.1 firmware on it if i recall correctly. BTW, she got famous M4G2DE chip...

So does the ext4 update mean the problem is fixed? Not that I want to experiment

amidabuddha said:
Just in theory when updating several apps from the Market (or other activities requiring use of /cache partition) it is possible that the /cache is filled with data and the device stucks at a point when it has no more space available to write. Rebooting to recovery and wiping /cache solves the problem. But if in that moment, when the app is downloading to /cache and another app is written from /cache to the /data partition at the same time, disconnecting the power source (battery pull) can interrupt the process making this partition unavailable (example: if you take out your USB flash drive from the PC while writing data on it there is a great chance to destroy it - tested myself ) The ext4 file system provides a protection for such cases by the way it is managing the writing process - reference here.
Click to expand...
Click to collapse
I agree with that. Most plausible theory i read about eMMC fries/corruptions. You've been helping people get out of this crap (i call it crap because it's HTC's fault, cheaping out on us) for a long time now.
What i don't understand is this - when downloading multiple apps from market (i had 2 - Angry birds, ES file explorer), the phone goes to sleep mode, and NEVER wakes up. No ADB, no 3 key combo, NOTHING. This leads to an unavoidable battery pull, which results in corruption like you said above. Why does the phone enter this "Sleep of Death" if you may call it? What the hell is the problem? Also, why not other HTC devices? (for that, i guess the answer is the unique slide out battery, only in 2 other devices - Bliss and Radar, whose battery can't be removed). If we can solve the "Sleep of Death" mystery, we'll get this issue out.
(More info - http://bit.ly/rXhDRR and http://bit.ly/v3lsS6 )
Also, DS (and all new HTC devices) are EXT4 by default. Flashing (not s-off'ing, but flashing a ROM after that) changes back to EXT3. That is probably why my phone survived the battery pull i did as said above. (It was freshly s-offed by XTC Clip, stock ROM).
Finally, i think this only happens to devices with already screwed eMMCs. Mine survived many battery pulls after the first one. The screwy ones are region specific. I never read anyone from India had the issue.

shrome99 said:
...Also, why not other HTC devices?
Click to expand...
Click to collapse
I am not quite sure...
...Next I got on the phone with the HTC help center. I got friendly with the lady technician on the call. After some nice chat I started probing for information on the Desire S. After a long conversation She told me that the Desire S, Incredible S, Desire HD all have the problem of frying the eMMC chip if the battery is disconnected while power is on. She said she gets calls every day with people who have fried their eMMC chip...
Click to expand...
Click to collapse
Also
The Desire S does not have a force shutdown keystroke combo as my old Desire did.
...
And because of a design flaw in the way the battery door closes, and because HTC did not include a force shutdown key combination to shut the phone off properly when locked.
Click to expand...
Click to collapse
The 3 key combo on Desire S performs a hard reset, 2 key combo boots to bootloader from a shut down state, but no combination for a force shutdown when powered on.
Taken from here
So from all the above is clear that the battery pull is the reason for this fault and I think that the deduction of the OP (post #1) is in the right direction, but unfortunately I do not have the required technical knowledge to comment.
For me the following is a must to avoid this problem:
always keep the USB Debbuging" ON in Settings
always disable "fastboot" option in Settings to prevent "hibernation" mode, i.e. running processes
keep your /cache clean (always wipe before installing anything)
stick to ext4 file system
if the device hangs anyway never pull out the battery, better wait it to be completely drained

The hard reset sucks. The one on my iPod touch ALWAYS works. You are stuck anywhere, hold home+power for 30 seconds, and there's a hard reset. This one rarely works

the snd time when my phone spoilt is when i accidentally pressed update all in market..
then the phone became sluggish and finally hanged.
as i've had experience and didn't dare to pull the battery,i used the hold buttons to restart method.I had to hold it for loonger time than usual(like 15 seconds ++)
after the phone's screen went oof,the phone never boot up again

tcchuin said:
the snd time when my phone spoilt is when i accidentally pressed update all in market..
then the phone became sluggish and finally hanged.
as i've had experience and didn't dare to pull the battery,i used the hold buttons to restart method.I had to hold it for loonger time than usual(like 15 seconds ++)
after the phone's screen went oof,the phone never boot up again
Click to expand...
Click to collapse
Wait, so you've "fried eMMC" chip by pressing Vol up+Vol down+Power only?

Hard reset while downloading apps = Cache corruption.
Sent from my iPod touch using Tapatalk

OMG And i just got Desire S for myself. Now S-Off in progress, can't wait to see what eMMC chip i've got...
EDIT:
Feelin' lucky Going to format all partitions as EXT4 and flash CM7.

levantine said:
Hello everyone.
1) The faulty guy is usually Samsung eMMC-type BGA chip KLM4G2DE (2 Gb NAND flash), however Sundisk chips were also found to burn.
2) The problem is rather hardware than software dependent as it is observed without any corellation to hboot/flash installed.
3) It was noticed that in many cases eMMC fault followed extraction-insertion of battery after phone freeze.
Click to expand...
Click to collapse
Just leaving this here for Google'ability. I've experienced a similar case on a Samsung device, equipped with a MAG8DD moviNAND (manuf. date ~10/2009). After pulling the battery the moviNAND died.
Code:
mmcblk1: mmc1:0001 MAG8DD 15.3 GiB
mmcblk1: error -110 sending status comand

bumping for more attention.

one more theory :
emmc has internal voltage regulator, that requare external decoupling capacitor.
from sandisk inand datsheet:
VDDi Connections
The VDDi (K2) ball must only be connected to an external capacitor that is connected to VSS. This signal may not be left floating. The capacitor’s specifications and its placement instructions are detailed below.
The capacitor is part of an internal voltage regulator that provides power to the controller.
Caution: Failure to follow the guidelines below, or connecting the VDDi ball to any external signal or power supply, may cause the device to malfunction.
The trace requirements for the VDDi (K2) ball to the capacitor are as follows:
• Resistance: <2 ohm
• Inductance: <5 nH
The capacitor requirements are as follows:
• Capacitance: >=0.1 uF
• Voltage Rating: >=6.3 V
• Dielectric: X7R or X5R
maybe there's PCB design problems (inductance), or too small capacitance of decoupling capacitor in DS ?

Ghosts in the (Odin) Machine

Heh.
A tale of tears.
I use an old Win Xp craptop (x32, 2 GB RAM) for anything dodgy - for instance running Odin or any other code of "uncertain provenance".
I leave it air-gapped and have a ghost image backup so the whole OS can be nuked and put back to square zero. (see "ntfsclone") Anything I need for it is just sneaker-netted.
So I'm trying to use it with Odin the other day, and I get irreproducible errors. With pure stock flashes, sometimes success, sometimes failures. Sometimes in the MD5 check in Odin, sometimes an "Auth Fail" on the phone.
WTF?
So I start doing MD5 checks manually. OK, bad checksums, there's the trouble. MD5s are OK on the sneaker-net USB stick, but sometimes not OK on the craptop HDD. No hardware complaints in the Event manager.
I temporarily conclude I have a dodgy USB port.
Use a different port, recopy all files. Check MD5s. All OK. Problem solved, right?
Run the MD5 check using Odin on 8 different stock firmwares (2.5 GB each, this is slow work). One of eight is bad. What? No event log hardware troubles evident.
Re-check MD5 on the bad one; it's correct. WHAT?
Out of frustration, I write a script that repetitively loops over all eight blobs, computing MD5 values and comparing to past results. Let it run for 50 loops: 8 * 2.5 * 50 = 1 TB of data reads. No Errors. WHAT?
Now I let the script run and let Odin also do a MD5 check, making sure that both Odin and the (cygwin) md5sum proggie are simultaneously reading the same file.
They both fail their checks. SERIOUSLY? Independent **read** operations interfering with each other? WTF?
So finally I do what should have been done hours earlier: I reboot craptop lappy.
And it POSTs with a memory error at 0x00035648CE4 - approx 854 MB.
Ahhhh, it now all makes sense: the erratic nature of the problem depended on whether the file data traversed through read cache in the affected memory area. The files themselves are bigger than physical memory, but the exact pattern of memory usage depends on activity on the laptop. One checker running reads, read cache usage is one pattern; two running reads and it's a different pattern.
But that's not the end of the story, oh no!
I remember that craptop lappy has two SO-DIMMs of 1 GB each. One is under a door in the back, one is under the keyboard. Some disassembly required!
The idea is this: that memory error is in the first stick (854 out of 1024 MB). If I swap the sticks, the memory error will move to 1876 MB. So long as the BIOS catches the error and "shortens" memory, I'll have a 1.8GB craptop. If BIOS won't reliably detect the problem, I'll chuck the second SO-DIMM, and have a 1GB WinXp craptop.
Before tearing anything down, I bust out an old copy of Knoppix (it has memtest86+ on it as an alternate boot), boot it up, and verify that yeah verily I seem to have a hard memory fault at that exact address reported by BIOS - 0x00035648CE4.
All things considered, it could be worse (e.g. massive random errors all over the place). At least it's only in a single fixed location, right?
So I tear down craptop lappy and swap the two SO-DIMMs; reassemble and boot memtest 86+, and get an error at a single location.
0x00035648CE4
[Edit]
I thought that this meant that I had a problem with the memory controller, rather than one of the SO-DIMMs.
But it turned out, that If I put only one of the SO-DIMMs in the first slot, one at a time, In one case I get no errors, and in the second case, I get a memory error at exactly
0x0001AB60664 - 427 MB. Almost exactly *half* the prior value.
So I suppose that the memory controller is interleaving banks between slot A&B. A little bit odd that the exact address would show up for a slotA <--> slotB DIMM slot.
Well, I guess that's good news. The old craptop can keep on chugging away, but with only half it's former memory. Or I can spend $13 to buy a replacement SO-DIMM.
Hope you enjoyed the read. (Misery loves company)

Database Info

welcome