Ghosts in the (Odin) Machine - General Questions and Answers

Heh.
A tale of tears.
I use an old Win Xp craptop (x32, 2 GB RAM) for anything dodgy - for instance running Odin or any other code of "uncertain provenance".
I leave it air-gapped and have a ghost image backup so the whole OS can be nuked and put back to square zero. (see "ntfsclone") Anything I need for it is just sneaker-netted.
So I'm trying to use it with Odin the other day, and I get irreproducible errors. With pure stock flashes, sometimes success, sometimes failures. Sometimes in the MD5 check in Odin, sometimes an "Auth Fail" on the phone.
WTF?
So I start doing MD5 checks manually. OK, bad checksums, there's the trouble. MD5s are OK on the sneaker-net USB stick, but sometimes not OK on the craptop HDD. No hardware complaints in the Event manager.
I temporarily conclude I have a dodgy USB port.
Use a different port, recopy all files. Check MD5s. All OK. Problem solved, right?
Run the MD5 check using Odin on 8 different stock firmwares (2.5 GB each, this is slow work). One of eight is bad. What? No event log hardware troubles evident.
Re-check MD5 on the bad one; it's correct. WHAT?
Out of frustration, I write a script that repetitively loops over all eight blobs, computing MD5 values and comparing to past results. Let it run for 50 loops: 8 * 2.5 * 50 = 1 TB of data reads. No Errors. WHAT?
Now I let the script run and let Odin also do a MD5 check, making sure that both Odin and the (cygwin) md5sum proggie are simultaneously reading the same file.
They both fail their checks. SERIOUSLY? Independent **read** operations interfering with each other? WTF?
So finally I do what should have been done hours earlier: I reboot craptop lappy.
And it POSTs with a memory error at 0x00035648CE4 - approx 854 MB.
Ahhhh, it now all makes sense: the erratic nature of the problem depended on whether the file data traversed through read cache in the affected memory area. The files themselves are bigger than physical memory, but the exact pattern of memory usage depends on activity on the laptop. One checker running reads, read cache usage is one pattern; two running reads and it's a different pattern.
But that's not the end of the story, oh no!
I remember that craptop lappy has two SO-DIMMs of 1 GB each. One is under a door in the back, one is under the keyboard. Some disassembly required!
The idea is this: that memory error is in the first stick (854 out of 1024 MB). If I swap the sticks, the memory error will move to 1876 MB. So long as the BIOS catches the error and "shortens" memory, I'll have a 1.8GB craptop. If BIOS won't reliably detect the problem, I'll chuck the second SO-DIMM, and have a 1GB WinXp craptop.
Before tearing anything down, I bust out an old copy of Knoppix (it has memtest86+ on it as an alternate boot), boot it up, and verify that yeah verily I seem to have a hard memory fault at that exact address reported by BIOS - 0x00035648CE4.
All things considered, it could be worse (e.g. massive random errors all over the place). At least it's only in a single fixed location, right?
So I tear down craptop lappy and swap the two SO-DIMMs; reassemble and boot memtest 86+, and get an error at a single location.
0x00035648CE4
[Edit]
I thought that this meant that I had a problem with the memory controller, rather than one of the SO-DIMMs.
But it turned out, that If I put only one of the SO-DIMMs in the first slot, one at a time, In one case I get no errors, and in the second case, I get a memory error at exactly
0x0001AB60664 - 427 MB. Almost exactly *half* the prior value.
So I suppose that the memory controller is interleaving banks between slot A&B. A little bit odd that the exact address would show up for a slotA <--> slotB DIMM slot.
Well, I guess that's good news. The old craptop can keep on chugging away, but with only half it's former memory. Or I can spend $13 to buy a replacement SO-DIMM.
Hope you enjoyed the read. (Misery loves company)

Related

[Q] Need help tracking down resource leak in WM6

At least I think it must be a resource leak. I have multiple WM6 devices that all behave the same way when my application is run. After a while, maybe 15 hours, they gradually deteriorate and refuse to start other applications. At first, it will just be an application like Opera that will not start. Eventually, things like File Explorer will also not start. When I say they won't start, I mean that I get the "not signed with a trusted certificate or one of its components cannot be found" message.
My first thought was that there was a memory leak, but according to the output of GlobalMemoryStatus, the system's memory use is not increasing over time. Then I thought that it might be storage space since I'm generating a big log file. But the storage space still sits above 60MB when this happens.
Restarting the device gets everything back to normal. So far, this is what I know:
1. I'm using the RIL. I noticed today that after about 12 hours, I stop getting RIL notifications.
2. I'm monitoring memory with GlobalMemoryStatus, but the available physical memory doesn't seem to be decreasing over time
3. The thread count remains constant for the life of the application
4. My storage space is decreasing, but there is still over 60MB available when everything starts to go wrong
5. In the end, the device winds up with the screen lock on, even though it is not configured.
It seems that there must be some kind of resource leak. The only other thing I can think of are kernel resources. I tried to rule out things like event handles through static code inspection, but maybe there's something I'm missing.
Does anyone have any suggestions as to how I would troubleshoot this further? I'm using VS2008 and a Tilt2 and an HTC Imagio (it happens on other devices as well).
I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.
sbaker25 said:
I tracked this down to a registry handle leak. What bothers me is that I had to do it by static code inspection. I just looked for things like CreateEvent, RegOpenKeyEx, etc.
Since this didn't seem to show up as consumed physical memory, does anyone have any methods for inspecting kernel resource consumption on Windows Mobile? Do I have to rely on KITL and Platform Builder with the emulator? I'm hoping that there's some way I diagnose this kind of problem with a real device. From my perspective, the device just started to fail and there were no external indicators to warn me of the impending failure.
Click to expand...
Click to collapse
hi!can somebody to help me?
i have asus p535 and i reset "start""settings"default settings"and i lost all from device.after apear align screen and remain like this
i think i have to install window mobile.can you tell me step by step how?
10000 thanks
Sorry, accidental re-post. Don't think I can delete it altogether...

Observations as to why market breaks / force close, and other anomolies

As I suspected early on the issues boil down to corruption within the User Data or Cache partitions, less often on the system partition due to an unexpected shutdown of the device. Shut on these devices need to follow the proper shutdown routine as any linux environment. Following this best practice will ensure that all data is written out to its corresponding file system by flushing all cache, unmounting the file system, etc..
Here are the culprits of why we see so frequent random Force Closes, Market Resetting, etc. ultimately resulting in an unclean shutdown, corrupting some data.
1. The button we use is also a forced off button. Typically if you hold it down too long you are powering off the device.
2. Some times when in sleep mode you see the Viewsonic logo upon starting - that means that the system shutdown (most likely crashed).
3. If your running Vegan your hitting the reboot.. I dont know for sure but I suspect this is NOT performing a clean shutdown... (I dont have a copy of the source)
Anyway... wanted to pass this on... as last night my data partition became corrupt after using the Reboot function on the Poweroff menu of Vega 5.1..
shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing
My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.
roebeet said:
My internal memory has to be repartitioned every few weeks - I'm certain that something is corrupting it over time. I had massive FC's just a week or so back where the SD partition re-do was the only fix.
I suspect that this happens in stock, as well - the problem of course is that there is no fix for a stock user, other than a return / exchange.
Click to expand...
Click to collapse
I have been on stock since I got the device just moving to the newer versions when they come as OTAs and have never ever had to mess with my partition, so I don't THINK the issue is in the stock software. In fact, the only problems I've ever encountered were when I used the enhancement pack, in which case my screen started to become unresponsive and the calibration.ini I was told to try did not work. Since then I went back to 3389 and the device has been perfect ever since.
I could be wrong though and just very, very lucky....here's to hoping. Another thing to consider is maybe the memory is going bonkers for some reason. I've had flash memory that lasted forever and I've had flash memory that has gone wacky over a period of 6 months....even a wipe by the utility designed to do it doesn't fix it properly. I don't know how CWM wipes or partitions the memory, I do know there's supposed to be a special way to do it.
If it's not faulty memory off the bat, then that leaves something in the 'extras' being put into these ROMs. Maybe some of the newer tegra drivers or some coding to make the ROMs faster - I'm just saying, can't leave any stone unturned.
Has anyone that has stayed loyal to stock encountered these issues? We have to ask that question I think. Then we ask how many of the people playing with ROMs are seeing the issues, this would include people that have used CWM to partition and mess with their mounts initially.
I can say I've never seen data disappear from my internal memory or my SD and I can also say I've never seen multiple FCs except after putting in the enh. pack (keep in mind I got my tab on Dec 20something, so I had 3053 and then 3389 soon after).
The first sign of anything being 'corrupted' on it's own at stock and I'll be sending mine back. As an owner of Android since Android's been around, I've never had my G1 or MT4G (or any smartphone before it) become corrupted due to not being shutdown or reboot properly and while this is a tablet I think the fundamentals should be the same. Pampering 'faulty' memory is a risk. You can wipe and re-do all you want, but if it's faulty it's going to stay that way.
Ive done that but I guess you can say unfortunately I have had only clean shutdowns since then... The last corruption I had I formatted my data and cache partitions before I ran logcat.... Of course thought of that afterward....
Generally if any has FCs, etc. etc. run a logcat and post it here... we will be able to confirm this...
We could change the way the partitions are created and add a sync which will further reduce chances BUT will take a performance hit...
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I reviewed the out of the box framework source on the google GIT and technically if a reboot command is given a clean shutdown is performed via the framework... but the widget on the shutdown screen I suspect is not calling the method properly or is not being called at all... All speculation at this point... But for sure there is corruption occurring..
Since the last corruption I switch over to pershoots kernel... Even though his kernel seems to be a little slower he seems to have included the latest drivers which other items relate to data integrity (im reading into the release notes).
NEO: The first thing I did when I got my device install CW, Vegan... Updated Kernels also... Never had an issue until the first time (yes about a day ago) I used the reboot feature of Vegan. That corrupted my user data. I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
But again.. between the wrongly placed power switch, the unprovoked reboots (ie viewsonic screen showing when trying to wake up the device) and the reboot button possibly not performing a proper shutdown will sure increase the chances in a wider distribution of users. So it may not be a CW issue and just some poor design.
When I have time today I will verify if the reboot function performs a clean shutdown... if anyone has the time please post the logcat... Im going to be running around today and will try to get to it..
watson540 said:
shouldnt need source code to debug a dirty shutdown..Cant you just run an adb logcat? maybe run the shutdown command in a terminal on the device and pipe the output into a text file for later viewing
Click to expand...
Click to collapse
stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
Click to expand...
Click to collapse
AFAIK they're running yaffs ATM. Next move is to ext4...
Read some articles about this several weeks ago, apparently many apps do not properly flush file caches. One of the articles was a Google developer post about file corruption along with their API method which did a cache flush prior to a close, then a bit later was the Google indication that they were planning to move to ext4 FS to further help alleviate the problem.
stanglx said:
I am very surprised though as the EXT3 filesystem is very resilient to dirty shutdowns (more than EXT4)...
I suspect if you have not been performing clean shutdown then you are just lucky. Linux, like any other OS, even with Journaling if you do not perform a clean shutdown you will surely encounter SOME corruption. Typically the corruption is re-mediated by the the file systems integrity controls. You dont even know it happened... 1 in 1000 the integrity controls can not overcome the significant loss of data and thus results in crashes, etc. Some times the corruption happens in areas where are lightly used thus why you would get a Market Reset... that data is easily replaceable on the fly. Core components that require subsystem to run are not replaceable and thus why I had to reformtat. What upsets me is that this failsafe is not working properly most likely as its far too frequent.... I too suspect it has something to do with CW.
Click to expand...
Click to collapse
That's my point. How many times since we've had our Android and smart phones have we had situations where they are turned off or rebooted without the proper procedures? Power drains till they die, they drop and reboot, we clog them up with stuff or some app drives them nuts and they reboot or shut off....Yet you rarely if ever hear about a phone's data being 'corrupted' with stock software. Sure it may happen with official OTAs etc, but never just off-the-bat like what's happening with the G-Tab. But it's not happening to everyone either so I'm just looking to see if there's a pattern.
Even since the G1 and newer phones, you don't really hear about or see file corruption issues on stock software with these phones. It's when users start going to ROMs that you hear of issues cropping up. That's not to say it doesn't happen at all at stock, I just think we're seeing it in a more concentrated fashion here because of all the formatting, re-partitioning, etc. At first you hear, 4GB is a great partition size, then you hear there are problems so move to 2048, then you hear 256MB swap, then no swap since Android doesn't use it. Then dataloop for speed, then no dataloop because of critical issues. Rules and instructions change almost on a daily basis. I think it's more than these poor flash drives can take I find sometimes it's good to keep it simple.
I owned a Vibrant for a while...decided it was a PoS when at stock I was seeing bad lag (because of Sam's terrible FS). People said...do the speedhack, it'll be fast!, but what was the caveat? Having to reboot the phone almost weekly, sometimes several times a week, and people were seeing what? Data corruption. That's not for me. Give me something that is lag free (doesn't have to be a bullet train, just don't skip on video or audio and make sure my live wallpaper and drawer animation is fluid and I'm happy!). Point being....keeping it simple may help to alleviate some of the issues. If people are seeing these problems with stock, then you're absolutely right and it would be a point of contention that the failsafe isn't working right.
Otherwise it seems the stock OS on these things are able to self correct in most situations and it may just be some of the many tweaked features in these ROMs doing something it shouldn't - or, I may just be very lucky indeed.
I'm still dying to get the OTA - I haven't seen one since 3899 yet.

[Q] USB Speed

Just wondering why my phone copies files so slowly. Using my iPod as a flash drive, I can copy a 250+ mb file over in seconds. On my Captivate, transferring a ROM.zip file (around 150 mb) takes at least five minutes. Is this normal?
5 mins is a lot I can do it in 10-15 seconds usually. It could depend on a lot of things what usb port is being used, the amount of files on the device, the devices write speed.
the ipod probably has about the same write speed. I highly doubt it is much over 30 mbs so it is probably another factor
Sent from my SAMSUNG-SGH-I897 using XDA App
Alright, I didn't think it should take that long. It must be my phone, because it happens on all my USB ports. Are you using the cable that came with it?
OBatRFan said:
Just wondering why my phone copies files so slowly. Using my iPod as a flash drive, I can copy a 250+ mb file over in seconds. On my Captivate, transferring a ROM.zip file (around 150 mb) takes at least five minutes. Is this normal?
Click to expand...
Click to collapse
Last time that happened to me with one of my externals it was because one of the SECTORS on the harddrive was corrupted. it would copy over but at severly low speeds. formatting/partioning didnt resolve anything, i knew it wouldnt but there was hope.
Could be that your device is about to fail or that storage has corrupted sectors.
How old is this device?
I got it back in July, about two weeks after it came out. If I remember correctly, it's been copying this slow for me since day one...
OBatRFan said:
I got it back in July, about two weeks after it came out. If I remember correctly, it's been copying this slow for me since day one...
Click to expand...
Click to collapse
All I can say that if its not the USB port (tried various ones), its not the cable (tried at least 2), then its probably the device. sounds like its the device's storage is failing.
you can try a program like easeus (im not sure if that will do exactly what you need) to check the integrity of your phone. just mount it and select the drive.
i usually use Ubuntu for stuff like this so im not exactly which program can help you if easeus cant. but ill look into a bit more.
EDIT: You can also try a chkdsk to see what populates.
start>run>cmd then chkdsk #:\
# being the drive letter your phone is.
Thank you for your help. I'm not exactly sure what EaseUs and Ubuntu are for, could you elaborate?
And what should happen when I chkdsk? It says "That type of file system is FAT32."
EDIT: So from what I gather, Ubuntu will stream data to your phone rather than over USB? How fast is this?
LOL...
Ubuntu is a different operating system (ie windows / osx ) - get alot of those guys here... (friendly poke)
My computer > right click your phone storage after mounting on phone > properties > tools tab > check for errors.
And you've tried this with a different cable/computer?
TRusselo said:
LOL...
Ubuntu is a different operating system (ie windows / osx ) - get alot of those guys here... (friendly poke)
My computer > right click your phone storage after mounting on phone > properties > tools tab > check for errors.
Click to expand...
Click to collapse
OH, LOL. I thought it was just a really cool piece of software. Thanks for the suggestion, I'll try it.
I'll try to use it on a different computer, and I'll try using my Kindle cable.
EDIT: Right click, properties, etc. showed no errors with phone. Kindle cable won't mount. Maybe it's a problem with earlier builds? Mine is 1006.

eMMC sudden death research

Update from Feb 17th:
Samsung has started to upgrade eMMC firmwares on the field - only for GT-I9100 for now.
See post #79 for additional details.
Update from Feb 13th:
If you want to dump the eMMC's RAM yourself, go ahead to post #72.
I'm looking for a dump of firmware revision 0xf7 if you've got one.
-----------------------
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)​2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).​10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?
Oranav said:
Since it's very likely that the recent eMMC firmware patch by Samsung is their patch for the "sudden death" issue, it would be very nice to understand what is really going on there.
According to a leaked moviNAND datasheet, it seems that MMC CMD62 is vendor-specific command that moviNAND implements.
If you issue CMD62(0xEFAC62EC), then CMD62(0xCCEE) - you can read a "Smart report". To exit this mode, issue CMD62(0xEFAC62EC), then CMD62(0xDECCEE).
So what are they doing in their patch?
1. Whenever an MMC is attached:a. If it is "VTU00M", revision 0xf1, they read a Smart report.
b. The DWORD at Smart[324:328] represents a date (little-endian); if it is not 0x20120413, they don't patch the firmware. (Maybe only chips from 2012/04/13 are buggy?)​2. If the chip is buggy, whenever an MMC is attached or the device is resumed:a. Issue CMD62(0xEFAC62EC) CMD62(0x10210000) to enter RAM write mode. Now you can write to RAM by issuing MMC_ERASE_GROUP_START(Address to write) MMC_ERASE_GROUP_END(Value to be written) MMC_ERASE(0).
b. *(0x40300) = 10 B5 03 4A 90 47 00 28 00 D1 FE E7 10 BD 00 00 73 9D 05 00
c. *(0x5C7EA) = E3 F7 89 FD
d. Exit RAM write mode by issuing CMD62(0xEFAC62EC) CMD62(0xDECCEE).​10 B5 looks like a common Thumb push (in ARM architecture). Disassembling the bytes that they write to 0x40300 yields the following code:
Code:
ROM:00040300 PUSH {R4,LR}
ROM:00040302 LDR R2, =0x59D73
ROM:00040304 BLX R2
ROM:00040306 CMP R0, #0
ROM:00040308 BNE locret_4030C
ROM:0004030A
ROM:0004030A loc_4030A ; CODE XREF: ROM:loc_4030Aj
ROM:0004030A B loc_4030A
ROM:0004030C ; ---------------------------------------------------------------------------
ROM:0004030C
ROM:0004030C locret_4030C ; CODE XREF: ROM:00040308j
ROM:0004030C POP {R4,PC}
ROM:0004030C ; ---------------------------------------------------------------------
Disassembling what they write to 0x5C7EA yields this:
Code:
ROM:0005C7EA BL 0x40300
Looks like it is indeed Thumb code.
If we could dump the eMMC RAM, we would understand what has been changed.
By inspecting some code, it seems that we know how to dump the eMMC RAM:
Look at the function mmc_set_wearlevel_page in line 206. It patches the RAM (using the method mentioned before), then it validates what it has written (in lines 255-290). Seems that the procedure to read the RAM is as following:
1. CMD62(0xEFAC62EC) CMD62(0x10210002) to enter RAM reading mode
2. MMC_ERASE_GROUP_START(Address to read) MMC_ERASE_GROUP_END(Length to read) MMC_ERASE(0)
3. MMC_READ_SINGLE_BLOCK to read the data
4. CMD62(0xEFAC62EC) CMD62(0xDECCEE) to exit RAM reading mode
I don't want to run this on my device, because I'm afraid - messing with the eMMC doesn't sound like a very good idea on my device (I don't have a spare one).
Does someone have a development device which he doesn't mind to risk, and want to dump the eMMC firmware from it?
Click to expand...
Click to collapse
:crying: --> **Ultimate GS3 sudden death thread** :crying:
Just wanted to link to a prior thread with some information/testing that as been done. Completely understand if you nuke it because it doesn't meet the proper criteria or is way to noobish to be posted here. Anyway, just though it _might_ help, so giving it a shot..
So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.
Oranav said:
So I decided to do a small RAM dump after all.
Before the patch, 0x5C7EA reads FD F7 C2 FA, which is "BL 0x59D72".
As I thought, they replace a function call to the new one.
I will dump function 0x59D72 later this week.
Click to expand...
Click to collapse
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
odewdney said:
So it looks like the new function calls the old function, and then if it returns ZERO then in goes into an INFINITE loop?!?
Seems like an odd fix, maybe self presevation?
Oli
Click to expand...
Click to collapse
Right, haven't spotted this. Thanks for the observation.
Self preservation sounds possible.
Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
This could be possible - this patch looks like a quick and dirty fix, so maybe they didn't have the time to properly fix this. Instead, they just avoid the bug absolutely (with the cost of data corruption).
But I don't think this would cause lockups - I believe the chip has a watchdog...
All in all, I think the best thing we can do right now is to dump the whole firmware out of it. I will do it soon.
there is also a chance that this is just a temporary workaround to prevent further bricking - until there is a final fix.
As of now we asume that this is the fix as it directly adresses the eMMC in concern, but all this is just based on asumptions.
Rob2222 said:
WELL... after I changed to XXELLA stock firmware and stock kernel in 01/13 my 06/12 SGS3 had the _first freeze ever_ on XXELLA. Maybe its completely unrelated and was only a random thing.
But, could it be, that this fix temporary (until reboot) locks the eMMC in a bad situation to avoid damaging internal data structures?
But then in this cases you get a phone freeze, cause the eMMC is temporary unaviable so the phone crashes until you reboot it. But it avoided eMMC data structure damage.
Sounds not very logical, but when you have to fix a problem and only have a few bytes to patch it (cause it must be run on every emmc-start), and the problem only occurs on a hand full devices (out of millions) then it is maybe acceptable to have a freeze instead of a dead eMMC in that rare cases that it occurs.
But this is only a idea... don't now if it is like that.
BR
Rob
PS: Oranav, thank you very much for your effort.
Click to expand...
Click to collapse
I think we can prove that the fix is actualy locking into loop ,but you must risk your phone :/ . If you are in ->
1) Flash back to older version without the fix
2) Wait and pray
a) If your phone dies --> the guys were right about the fix and the loop
b) If it stays alive ,then ...
We wont know for sure ,but your phone is maybe in the "perfect" condition for the test :/ .
(Sorry if this makes no sence)
@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert
Rob2222 said:
I have no idea if it's that way, I just wanted to post it as theory to think about.
Click to expand...
Click to collapse
The problem is that there are too many theories imaginable, but I can't think of no way to prove them but to reverse engineer the MoviNAND firmware.
Rob2222 said:
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
Click to expand...
Click to collapse
Certainly not. Watchdogs are slow, drivers running on a Cortex-A9 are blazing fast.
But I do think Linux's MMC driver can handle device restarts during an MMC operation.
Rob2222 said:
@ivan:
Sorry, can't do that. Cause of high air humidity my humidity indicator is already a little soaked. Cause of the warranty-repair reports in out local forums I am not sure if I would get warranty. I think theres a fair chance, that they would deny my warranty. Cause of that I don't want to take any extra risk. I am on unrooted stock at the moment, cause of that.
@Oranav:
In our local forum we get some reports about a rising count of locks and restarts on S3's in the last time. Some like my freeze.
It also seems that after a while this problems gets better and even disappear completely.
Cause of that I am thinking, if it could be, that the fix maybe locks the eMMC if it finds a bad data structure, then this locks maybe could bring a phone-freeze (already stated that), and in the same time it repairs the data structure in this block with the bad data structure.
At least this would explain some rising count of freezes with the fix and the point, that the freezes become less and less over time.
I have no idea if it's that way, I just wanted to post it as theory to think about.
BTW, do you think when the watchdog restarts the eMMC that it goes that fast that the phone isn't affected?
BR
Robert
Click to expand...
Click to collapse
if those freezes are really caused by the firmware fix, I don't see how the would disappear over time...
I mean if it really is the case that the fix trades data corruption for eMMC survival, it would make sense to see freezes... but depending on what data is affected, they should only be treatable by reinstalling the affected app or deleting its data/cache.
updated theory:
for all we know the error condition where the eMMC dies is quite rare, since most devices have been used for month before they passed away. So under the assumption that the error condition appears randomly and that there is a chance of data corruption every time the condition appears with fixed kernel, we could expect to see freezes and other problems some time after the fix was applied. So that would explain the raising number of freezes reported. Furthermore I'd assume that people getting freezes would try to do something about it, like reinstalling/deleting apps wiping caches and/or data... or even reflashing, thus repairing the corrupted data. So freezes would disappear.
Wait, doesn't most evidence point to the fact that the error condition does NOT appear in a random fashion, since there were no cases in the beginning, and then a lot all of a sudden? Well, it might be that this is just the way we perceive the issue. Maybe there were cases before, but they weren't reported... phones died, people sent them in, got new ones and went on with their lives. But after some time the issue got known... bloggers wrote about it... and so on... people realized their phones died because of a wider problem... voila, steep raise in reported cases. Also the number of dying S3s would simply rise by a rising number of overall S3s, I mean Samsung kept selling phones, right?
But even under the assumption the bug is related to wear-levelling and not random, here is another idea: I have no clue how the algorithms work, but maybe it uses some sort of pseudo-random data to do whatever, with the same seed on all eMMCs... and thus all of them go through the same series of numbers. And now imagine the error condition is only triggered by a specific number or number set (say someone screwed up a boundary condition). Under this theory the error condition wouldn't appear randomly, but after a certain amount of write ops (or something).
Another question I asked myself is: shouldn't there be cases were data corruption does damage beyond all repair except for reflashing?
Well, it might be, but it seems reasonable to assume that it is a lot less likely than user-data corruption, since most critical files on the phone shouldn't be opened writeable (or are on a read-only mounted partition in the first place), hence shouldn't be affected by ****-ups during writes.
Like the previous poster I want to add that this is most likely all bull****... but it is what came to my mind looking for a theory that supports the data we got.
Okay, got a RAM dump
I won't post it here (or anywhere else for that matter) because I don't want to get sued by Samsung.
I might release a kernel which allows you to dump the RAM yourself if there's enough demand, but I don't want to right now, because:
1. The code is ugly as hell, not implemented as a kernel module, not thread-safe etc.
2. It is highly dangerous (messing with the eMMC chip - I really don't know how much stable this thing is), so if you want to do it on your device, you should be an expert. In that case, you can write the code yourself (with little effort)
Anyway, I hope the FTL is Whimory, since I'm familiar with it. Would be easier.
I'll let you know if I find anything interesting.
PS I've attached a little teaser. (Yes, this is the patched function. 0x40300 is red because I've opened a partial RAM dump.)
EDIT - Some initial results:
0. The CPU is a Cortex-M3.
1. No strings at all Just some uninteresting release asserts ("REL_ASSERT")
2. Found the Smart Report generator function -> found the MMC command handlers.
3. Most MMC commands handlers are stored in a function table. There are 3 special commands: MMC60, MMC62, MMC64. Depends on the arguments these special commands are provided, they modify the function table (this is the so called "vendor mode").
4. There are a lot of possible arguments for MMC62, not the only ones we know.
5. If you trace back the function they patch all the way up the call stack, you get to MMC24 and MMC25 handler. These commands are MMC_WRITE_BLOCK and MMC_WRITE_MULTIPLE_BLOCK. Since the function they patch is deep down the call stack, it's very likely that it is the wear level.
Anyway, because of the lack of strings I guess it would be very hard to truly understand the SDS bug we're facing
Odp: eMMC sudden death research
i cant say i have an idea whats going on inside emmc but usually in this case of mistakes/failures debug or diagnostics code is used for release.
maybe some debug info repeatedly written triggers wear levelling failure
so fix has to simply disable it
Awesome research.
So we're dealing with bug in exactly the same eMMC subsystem as in faulty SGS2 eMMC chips, but in device that was released after proving SGS2 eMMCs to be faulty.
Oranav, for some reason I cannot send you PMs. Could you send me your dump? Does your eMMC come from faulty serie?
Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
owl74 said:
Hi all, after reading this thread, I am now scared....
I have a Note 2 N7100 which is running ARHD V8.0 with Perseus Kernel V31.2 and TWRP recovery 3.2.2.3
The above all include the fix for SDS and Exynos hole.
I have been running the device for nearly 1 week I think. Last night I fully charged my phone, used it for 3 minutes surfing the forum (chrome) via wifi connection. After 3 minutes, I left the phone on the table for 3 hours. The only running app is Viber... when I tried to wake the phone up, it did wake up but it froze... no button worked, I tried for about 2 minutes... nothing worked except the power button which booted the device.
This is weird, never experienced this before... I am now scared the phone will die unexpectedly.
Click to expand...
Click to collapse
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app
Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Click to expand...
Click to collapse
Damn Oranav, nice work!
So does it mean you get a total screen freeze? Every time?
BR
Robert
thealgorithm said:
How long were you using the system before you had updated to the 'fix'? None the less, it does not necessarily mean that the phone is getting near to the SDS. I have had a few android phones which would sometimes reboot or hang for other reasons.
Click to expand...
Click to collapse
About 2 weeks. Forgot to mention i reebooted the phone before I charged it. I will return this phone before it dies on me.... I think I will get an S3 but I will check if it has a new chip... otherwise I will return again and stick to my desire hd which is already running S3 rom....
---------- Post added at 02:24 PM ---------- Previous post was at 02:22 PM ----------
Oranav said:
I tried to simulate an eMMC freeze (by forcing it to go into an infinite loop). It behaves exactly as you describe - the phone works for a second, then becomes totally unresponsive. Seems like there is no watchdog.
Rebellos, I enabled the private messaging system for me. I do have the faulty chip.
Sent from my GT-I9300 using xda app-developers app
Click to expand...
Click to collapse
So everyone's speculation is right about thw fix causing the freeze...
AW: eMMC sudden death research
Suppose this fix addresses wear leveling. If firmwares without this fix wear out the eMMC would then not the device still boot and then crash? As far as I know flash is still readable but not writable any more when worn out. Could it be that the wear leveling algorithm has a problem so that after some time it replaces cells from the bootloader and that causes the death?
In short: I want to know if it had a negative effect using the old firmware for some time because that old software caused extreme aging for the eMMC.

[Q] Own build PC freezing for over 6+ months

Hello XDA!
I have a big question, and i don't know where to put it, so i put it here, here it starts:
I build an own pc, with everything included etc. and at first it worked great, until one moment when it just stopped working great and fast, and started freezing randomly.
At first i thought it was when i was watching an video, but now i know that's not the case, because i found out it is freezing at random moments,
I searched much, since december or so, but could not find any similar things.
I tried Dual-boot, with Windows 8 and Ubuntu running both, to see if windows 8 was the problem, it wasn't, because it did the same with Ubuntu.
I tried System reset ( Erase whole windows + All files), Did not work.
I updated all driver software, did not work.
at the end i tried chkdsk, because i thought, maybe it is the Disk, but everytime i do chkdsk, it freezes at 28%
last week i did something else, and it said that it found errors on the disk and i had to type chkdsk /f to fix it, when i did, it said No Write Permission on disk.
I also erased the whole disk with cmd ( All volumes, partitions etc.) but still, the odd thing is, it says everything is deleted, but still it shows the disk is 465GB big, which is odd because i have an disk of 500GB ( Now i know the disk itself takes up some space, but that's not 45GB, right?)
With freezing, i mean ''freezing'' by the way ( It wont start again, only when i press the power button) and when i watch a video or listen to music, i hear some annoying sound from the boxes too.
My specs are:
AMD A6-5400K ( APU)
Trinity", Black Edition
ASUS F2A85-M ( Motherboard)
Western Digital WD5000AAKX 500 GB ( HDD)
Corsair 8 GB DDR3-1600 Kit ( RAM Memory)
OCZ ModXStream Pro 700W ATX2 ( Power supply)
So.. Can anyone help maybe? Much appreciated!
Also: If someone knows how to make it writable again, let me know ( I tried one command what should do the trick, but cmd said Failed.)
PS: Is Not writable the same as Read only?
PPS: I overclocked my PC, yes, but that should not be a problem, because i resetted it multiple times.
PPPS: I changed the registry once, so it may be messed up a little, but the registry is resetted too when resetting windows, right?
Thanks!
If you guys need more info, Ask!
Off-Topic PC thread would be a better place for this.
By freezing do you mean that the pc is shutting down and nothing happens?
I would say make a stress test with every part of your hardware.
I found this:
http://www.computerhope.com/issues/ch001088.htm#00
With freezing i mean doing nothing... And stress test, i will look at it now.
Thanks!
Sent from my C6603 using xda app-developers app
I moved the OP to the Computer Hardware thread: http://forum.xda-developers.com/showpost.php?p=43359268&postcount=5340
Please refer all the discussion in its new home.
Thanks!

Categories

Resources