Posts by Femue

1) Message boards : SZTAKI Desktop Grid : New applications 2.0 /2.03 online - Feedback (Message 4257)
Posted 4029 days ago by Femue
then your wu keeps restarting over and over and never makes progress.
...
If the line numers are increasing ... the wu is worth to be crunched until the end.


... It appears that checkpointing is very primative: it just restarts a line that it has been working on. Not too good. No wonder there is no progress. ...

According to my experience, checkpointing is not primitive. I am now processing a WU with 5 lines. After more than 116 hours it is at 80%, which is an average of 29 hours per line. I have to shut down my PC every morning at more or less the same time, so I could never finish this WU if the program fell back to the start of a line. Here is an extract of my file, where you can see that the program had to start twice after the third line. Nevertheless later on it had finished the fourth line. This is for me the proof that it does not fall back to the start of a line, but does go on where it was at the last checkpoint (unless if during the second attempt to crunch line 4, the processor was much faster???). Maybe the expression \"Restarting line 3\" could lead to a different interpretation. My conclusion: I am very happy with the checkpointing system.


Hello Robert,

I would like to be proven wrong. The expression \"Restarting line ...\" really led me to my opinion.

My conclusion at the moment is: because I keep loosing heartbeat about every 2-5 hours I think i can never reach the next line on a wu where the crunching of a single line takes longer than that, and in fact i never did :-(

Maybe we can have an official statement on that? If I am wrong [I have no nerve to test this out atm] I would not abort any more wus after a few lost heartbeats in the same line - so i would not dump the earned credits, too ;-)

Cheers
Femue

[edit]

Just saw: this message changes (the number is sometimes up, sometimes down, sometimes the same even in the same line of a wu):

scan: dc_ckpt_263
...
scan: dc_ckpt_257
...

But what does this mean? Recursion level in that line?

[/edit]
2) Message boards : SZTAKI Desktop Grid : New applications 2.0 /2.03 online - Feedback (Message 4249)
Posted 4029 days ago by Femue
I have 11 Search 2.00 WUs of which the first have been going for over 10 hours, progress is 0% and still over 7 hours to go (which was the innitial \'To Completion\' time). I suspended all WUs now to find out if I should abort? I don\'t get any error messages.


Tobie,

to find out more about progress of a wu: you can look in the BOINC-Directory on your hard disk, find the slot where SZTAKI is running (something with a search....exe in it) and there is a file called stderr.txt. Open it with a text editor. If it is not empty you may see something like this:

...
Initialized: Lines to process 2
Starting with line 1
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
No heartbeat from core client for 31 sec - exiting
Initialized: Lines to process 2
Restarting from checkpoint: Lines processed so far 1
Restarting line 1
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_36
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_lastckpt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
Initialized: Lines to process 2
Restarting from checkpoint: Lines processed so far 1
Restarting line 1
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_60
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_lastckpt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
Initialized: Lines to process 2
Restarting from checkpoint: Lines processed so far 1
Restarting line 1
...

then your wu keeps restarting over and over and never makes progress.
When I see this too often, I abort the wu (makes no sense to keep on crunching the wu on that host).

If the line numers are increasing

...
Initialized: Lines to process 100
Starting with line 1
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
No heartbeat from core client for 31 sec - exiting
Initialized: Lines to process 100
Restarting from checkpoint: Lines processed so far 15
Restarting line 15
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_151
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_lastckpt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
No heartbeat from core client for 31 sec - exiting
Initialized: Lines to process 100
Restarting from checkpoint: Lines processed so far 40
Restarting line 40
scan: boinc_lockfile
scan: dc-api.conf
scan: dc_ckpt_254
scan: dc_ckpt_out
scan: dc_clientlog.txt
scan: dc_lastckpt
scan: dc_stderr.txt
scan: dc_stdout.txt
scan: in.txt
scan: init_data.xml
scan: msvcp71.dll
scan: msvcr71.dll
scan: out.txt
scan: search_2.00_windows_intelx86.exe
scan: stderr.txt
scan: stdout.txt
Initialized: Lines to process 100
Restarting from checkpoint: Lines processed so far 43
Restarting line 43
scan: boinc_lockfile
...

the wu is worth to be crunched until the end.

If stderr is empty then I do not know ;-) - maybe then you have caught a very long wu.

I think this also shows that gwg is right about checkpointing:

... It appears that checkpointing is very primative: it just restarts a line that it has been working on. Not too good. No wonder there is no progress. ...


If you abort one wu you may be more lucky with the others. Of eight wus I crunched recently I had to abort two due to no visible progress (always falling back to line 1 - see above).

Cheers
Femue

[edit]Fixed typos, added information[/edit]
3) Message boards : SZTAKI Desktop Grid : New applications 2.0 /2.03 online - Feedback (Message 4202)
Posted 4035 days ago by Femue
Femue,

your result problem is strange, I\'ll investigate it some more...


Thank you ... i have a few other wus pending that are waiting for other hosts to return their results. I am curious to see what happens to them (the results, not the hosts :-).

Here is another result where that happened (without linux boxes involved):
<a href=\"http://szdg.lpds.sztaki.hu/szdg/workunit.php?wuid=2866\">http://szdg.lpds.sztaki.hu/szdg/workunit.php?wuid=2866</a>

Maybe one clue i thought about ... my host looses heartbeat fairly often (that is why i have to abort some wus, because they always fall back to the last checkpoint [or line] and never proceed if the time to the next checkpoint [or line] is too long). But then i saw other hosts loosing heartbeat as well and they were credited with a high amount even with linux involved, heres one example:
<a href=\"http://szdg.lpds.sztaki.hu/szdg/workunit.php?wuid=6798\">http://szdg.lpds.sztaki.hu/szdg/workunit.php?wuid=6798</a>

In both examples the same <a href=\"http://szdg.lpds.sztaki.hu/szdg/results.php?hostid=148648\">host (148648)</a> is involved.

Regards
Femue

[edit]
Added information.
[/edit]
4) Message boards : SZTAKI Desktop Grid : New applications 2.0 /2.03 online - Feedback (Message 4164)
Posted 4039 days ago by Femue

you shouldn\'t abort these WU\'s. The problem of other platforms failed to validate against linux boxes have been solved by linux version 2.03. Hopefully, all the linux boxes have already downloaded this version.


Hi,

<a href=\"http://szdg.lpds.sztaki.hu/szdg/workunit.php?wuid=7152\">workunit 7152</a> shows that one windows host validated against a linux box with version 2.03, my windows result was delivered back as successful but marked as invalid. I got no credits (sigh) and there is just a quorum of two results now.
What is the reason for this?

Femue
5) Message boards : SZTAKI Desktop Grid : Ad@m\'s POD (page 7) (Message 3940)
Posted 4061 days ago by Femue

... stop the server transitionally, till next monday (2006.09.11.) During this time, all of you who have WUs that is close to finish, will have time to finish. ...


Erm ... but as the server is down I can\'t report the ones that I finish now!?

Femue


Home | My Account | Message Boards


Copyright © 2017 SZTAKI Desktop Grid