NokiMo
Tropecita Games
Tropecita Games

patreon


Problems in paradise (I)

This is the post no one wants to write when working with computers.

The PC dedicated to rendering that I bought in September is crashing.

Last night, I queued 93 renders to do overnight and, when I woke up this morning, it had done only four and the computer was restarted. Daz render queues do things like that from time to time, so I didn't think much of it. I restarted the queue and it started rendering. Without finishing the render the computer shut down. Not restarted. Shutdown.

I tried to turn it on and there was no way to start it. I followed the typical steps of disconnecting it from the power and then it entered the BIOS in safe mode. It did not show any error so I exited the BIOS and tried to boot it again. It does not pass the POST of the BIOS and restarts again. I have spent two hours doing several tests without success, so tomorrow I will open a warranty claim to the Spanish manufacturer or, failing that, to the online store where I bought it. They only work Monday to Friday.

How can this affect the update schedule? Right now I have no idea. It could be that tomorrow they will give me a solution (like updating the BIOS, for example) and everything will work as usual. I might have to send the computer back to them for repair or even get it replaced under warranty. But I don't know how long all this might take as this is the first time I'm working with that manufacturer.

An added problem is that the 328 renders that were already made for 0.18 are on that computer and, not being able to get to Windows, I can't retrieve them until the failure is resolved. So in case the computer is replaced I will have to re-render them.

What am I going to do while it is being resolved? Besides the time lost trying to solve the failure and the possible time spent in re-rendering those 328 renders, it shouldn't affect the development, since I still have my old computer to continue writing and posing renders. Unfortunately, you quickly get used to the good stuff, and the renders (and the story for v.0.18) are designed with the capabilities of the new computer in mind. The old one would not be able to render many of the images that are going to be there, especially the group images.

This post may be too alarmist right now, but I've always tried to be transparent with the successes and failures, with the problems and when there have not been any.

I'll keep you posted as soon as they tell me something tomorrow.

Comments

Thanks, TDS. A manufacturer's tech called me 40 minutes ago and we did a video conference so he could see the issue live. He asked me to disable XMP after seeing the debug lights on the motherboard and we got to Windows (that means I will be able to retrieve the renders later). Now I'm testing the RAM under his instructions. I think it will take around 20 more minutes (it's been running for ~30) and I should call him once it finishes (no error found yet). I'll keep you posted.

Tropecita Games

Alas, those two sensors you name does not measure the bridge-chips on the mainboard. The CPU may be perfectly cool (and often is), and the mainboard one measures ambient temperature. Monitoring of individual chips on the mainboard is usually only done on server hardware (and extreme-class expensive workstations). The point about it failing without error (just shutdown) during the post is exactly why I thing it's a thermal issue (probably incorrectly applied thermal paste during manufacturing) with the bridges. These will heat up as data starts passing between the CPU and memory/bus peripherals (disks, gpu, etc), and if the thermal conductivity pads or paste isn't applied correctly that heat will not dissipate into the heatpipes or sinks. I know that several mainboard manufacturers had issues with a subcontractor a time ago (among them asus, supermicro, etc) who supplied heatpads (they were "too dry", meaning their effective lifespan got reduced to a fraction - which will impact heavy use machines more than light duty). If the shop doesn't help you, feel free to DM me (I don't feel comfy "outing" myself in public, but I can give you a good rundown of diagnostic and why I know these things privately). :)

TDS

Shit happens. What worries me the most is if they ask to send the whole computer. It could take some days to get it back.

Tropecita Games

I have a USB enclosure that accepts M2, so it isn't an issue. The disk is bitlockered, but I have the codes, so no problem either with that. While I was writing the two lines above, the technical support called me for the case I opened yesterday. They instructed me to do some changes in the BIOS and some testing that I'm doing right now. The technician thinks it can be a RAM thing (he asked me to disable XMP before the test). With XMP disabled I got into Windows, so I will be able to retrieve the renders once the testing is finished. Yihay!!!!

Tropecita Games

The problem with doing that without their permission is that it would void the 3-year warranty. I have an M2 just for Windows and software, another one for Daz libraries, and the last one for finished renders. I only need to retrieve this one. I have an enclosure that accepts M2, so there will be no problem if they allow me to get the disk out.

Tropecita Games

Right now the main candidates are PSU and mainboard. I did a test last night to check the CMOS battery. I left it unplugged for 12h and when I tried to boot this morning, the date and time were correct.

Tropecita Games

I think it's not thermal. I did some testing and it fails going beyond POST (safe mode) even if the BIOS says the temperature is 35ºC for the CPU and 27ºC for the mainboard. I'll ask them for permission to remove the M2 disk to get those renders out. It won't advance the update, but I could start testing the first scenes.

Tropecita Games

I was going to suggest this, definitely faster than potentially losing 328 renders, as long as your install isn't bitlockered. even if your install is on an NVME SSD you can get a little usb-c enclosure and slurp the files off super quick.

Draxl

I am sorry that this happened to you Trop. Best of luck.

Magoo Doug

Take out your hard drive use an external enclosure and copy to a external hard drive with your old PC to save your files,🤔🤨 ( I use 4 external drives for backup just to play the games watch videos est.. because of crashes😲😢 My problems have been hard drive failure My windows is on a 2TB PCIe Gen 4 NVMe M.2 Internal Solid by it self so always boot up, never be more careful) 🤔😅

xpire2000

What TDS said. Also if you haven't already, you may want to try replacing the CMOS battery. It is possible that it has died, especially if you're leaving your system cranking like that over night. 🤷🏽‍♂️

KkaosReinz

At first glance, this sounds a lot like a thermal issue with one of the chipset bridges (mainboard components). But unless you're a GOOD diy-person when it comes to electronics, I suggest you don't mess with it yourself (at least not until the shop has had a chance to honor their warranty). As for getting your data out, it shouldn't really be a problem (use an external enclosure and just mount the filesystem and copy the files onto another computer). Anyways, thanks for keeping us updated, and best of luck dealing with the shop!

TDS

This always happens on weekends.... Take all the time you need. And if you are takeing some days off because of it, it is fine.

Donar

Thanks for letting us know Trop. Shit happens and if the worst comes to the worst and you have to slip your schedule to maintain your quality then no problem.

David Burgin


Related Creators