Long running WU threatening to miss deadline


Advanced search

Message boards : SZTAKI Desktop Grid : Long running WU threatening to miss deadline

AuthorMessage
wstomv
Send message
Joined: Jan 16 07
Posts: 4
Credit: 603
RAC: 0
Message 5715 - Posted 16 Feb 2007 6:04:40 UTC

    My computer is working on result and now has spent a little over 200 hours on it. It reports 60% progess and another 85 hours until completion. The deadline is 22 Feb.

    I will be away from the Internet for at least until 25 Feb.

    Conclusion: This deadline will be missed (unless it unexpectedly finishes within the next few hours).

    This is very unfortunate because of the amount of time already spent on it.

    How much sense does it make to continue work on this WU? Can the deadline be extended?
    ____________

    robert.mouris
    Send message
    Joined: Nov 3 05
    Posts: 129
    Credit: 4,124,194
    RAC: 0
    Message 5716 - Posted 16 Feb 2007 7:04:08 UTC

      Last modified: 16 Feb 2007 7:08:09 UTC

      You don\'t really need to report by the deadline in this project. When you miss it, a replacement WU will be created and after about 4 days sent to someone else. If you report before that time, your result will be accepted and the replacement WU deleted.

      Even after that day, you may continue. I have the strong feeling that your result will not be enough to get the necessary quorum of 3, as the other 2 results have such short processing times compared to yours. As long as the quorum is not reached, your result is welcome. I\'m crunching now a WU with deadline 17 January, because it still needs valid results.

      My main concern for your WU is not the deadline, but somewhere beyond 200 hours many WUs crash and get the message \"Maximum CPU time exceeded\". Sometimes they don\'t (Odysseus reported one after nearly 700 hours processing time), it\'s gambling.
      ____________

      Profile [B^S] Gamma^Ray
      Send message
      Joined: Jan 3 07
      Posts: 18
      Credit: 5,977
      RAC: 0
      Message 5717 - Posted 16 Feb 2007 7:31:36 UTC

        Last modified: 16 Feb 2007 7:32:21 UTC

        I suggest you read the following thread to help you decide : Output Empty Good Luck !

        Regards,
        G^R
        ____________

        Profile kadam
        Project administrator
        Avatar
        Send message
        Joined: May 25 05
        Posts: 589
        Credit: 38,614
        RAC: 0
        Message 5723 - Posted 19 Feb 2007 11:00:44 UTC

          We are currently testing a program that pre-filters the matrixes and dismisses those which might run too long.
          ____________
          If you like BOINC, you may also find CaretCursor to be appealing.

          Profile kadam
          Project administrator
          Avatar
          Send message
          Joined: May 25 05
          Posts: 589
          Credit: 38,614
          RAC: 0
          Message 5724 - Posted 19 Feb 2007 16:20:23 UTC - in response to Message 5723.

            Last modified: 19 Feb 2007 16:20:49 UTC

            Preparation for the upgrade of the server software has begun. I will keep you updated, but prepare for longer stops in the next 2 weeks.
            ____________
            If you like BOINC, you may also find CaretCursor to be appealing.

            Profile Nightbird
            Forum moderator
            Avatar
            Send message
            Joined: Jul 12 05
            Posts: 920
            Credit: 114,924
            RAC: 0
            Message 5725 - Posted 19 Feb 2007 21:05:04 UTC - in response to Message 5723.

              Last modified: 19 Feb 2007 21:07:19 UTC

              We are currently testing a program that pre-filters the matrixes and dismisses those which might run too long.

              What\'s \"too long\" ? Do you plan to reduce the deadline ?
              I don\'t think that the most important problem is how many time will run a wu (try a Climat Prediction wu) though nobody is interested to run a wu during more 60 hours and at the end get the now famous message \"output is empty\" ... or nobody is really interested to put a machine in EDF mode or to see the LTD increasing..
              The big problem is the checkpoint. It doesn\'t work for some wus or question : is a 1 line wu able to checkpoint ?


              ____________

              robert.mouris
              Send message
              Joined: Nov 3 05
              Posts: 129
              Credit: 4,124,194
              RAC: 0
              Message 5727 - Posted 19 Feb 2007 22:01:36 UTC

                I have no problem at all with long WUs, as long as crunching them makes sense. All I need is a working checkpoint system (as I must shut down my system from time to time, or Windows or the electric fuse does it), no crash after 200 hours and a sufficiently high \"max # of error/total/success results\" to meet the quorum requirements (if the checkpoint system works, the quorum and the max numbers can be low).

                EDF or LTD has never been a problem to me. I crunch virtually nothing but Sztaki (main project) and CPDN (in the background to fill the gaps). As soon as the problems are solved here, I give again full power to Sztaki. What\'s wrong with long crunching times or long deadlines or late quorum meetings? Expecting a baby takes 9 months. And elephants wait for 2 years.
                ____________

                Stick
                Send message
                Joined: Jun 12 06
                Posts: 193
                Credit: 66,271
                RAC: 0
                Message 5732 - Posted 20 Feb 2007 17:15:44 UTC - in response to Message 5725.

                  Last modified: 20 Feb 2007 17:16:42 UTC

                  The big problem is the checkpoint. It doesn\'t work for some wus or question : is a 1 line wu able to checkpoint ?


                  I am beginning to think that Adam doesn\'t have a clue as to how to fix the checkpoint problem. What else might explain he chooses to respond on other issues while completely avoiding this one?

                  ____________

                  robert.mouris
                  Send message
                  Joined: Nov 3 05
                  Posts: 129
                  Credit: 4,124,194
                  RAC: 0
                  Message 5733 - Posted 20 Feb 2007 19:03:31 UTC - in response to Message 5732.

                    Last modified: 20 Feb 2007 19:10:47 UTC

                    is a 1 line wu able to checkpoint ?

                    In that case I wonder if any WU is able to checkpoint. As far as I know, a 25-liner is just a loop of index 25 where each line behaves like a 1-liner.

                    I am beginning to think that Adam doesn\'t have a clue as to how to fix the checkpoint problem. What else might explain he chooses to respond on other issues while completely avoiding this one?

                    I think that he doesn\'t even try to solve the checkpoint problem. If I read his New Year\'s speech, I understand that the BinSys project will soon finish for us and that they will crunch the remaining problem WUs at their Institute where they can run them 24/7 on their PCs, exactly the way they will crunch dimension 13 etc. However, if the application crashes after 200 hours, it will also be for them, so he must find a solution to this.

                    My hope is that they are already looking forward and not backward, and preparing the new application(s).
                    ____________

                    Stick
                    Send message
                    Joined: Jun 12 06
                    Posts: 193
                    Credit: 66,271
                    RAC: 0
                    Message 5736 - Posted 21 Feb 2007 1:11:01 UTC - in response to Message 5733.

                      In that case I wonder if any WU is able to checkpoint. As far as I know, a 25-liner is just a loop of index 25 where each line behaves like a 1-liner.


                      I think it is pretty well established that checkpointing works OK when the program finishes a line and begins another. It\'s just mid-line checkpoints (mainly on long WU\'s) that cause problems. If the program is restarted and attempts to revert to one of these faulty checkpoints, it immediately jumps forward to the end the line, instead of going backward to a previous, valid checkpoint (or to the beginning of the WU). And, this is what causes \"Empty Outputs\" to be generated.

                      I think that he doesn\'t even try to solve the checkpoint problem.


                      My previous post (saying that Adam doesn\'t have a clue) was an attempt to goad him into commenting. Obviously, it didn\'t work (not yet, at least).

                      If I read his New Year\'s speech, I understand that the BinSys project will soon finish for us and that they will crunch the remaining problem WUs at their Institute where they can run them 24/7 on their PCs, exactly the way they will crunch dimension 13 etc. However, if the application crashes after 200 hours, it will also be for them, so he must find a solution to this.

                      My hope is that they are already looking forward and not backward, and preparing the new application(s).


                      I hope so, too!
                      ____________

                      Odysseus
                      Avatar
                      Send message
                      Joined: Feb 27 06
                      Posts: 212
                      Credit: 221,397
                      RAC: 0
                      Message 5737 - Posted 21 Feb 2007 7:25:19 UTC - in response to Message 5736.

                        I think it is pretty well established that checkpointing works OK when the program finishes a line and begins another. It\'s just mid-line checkpoints (mainly on long WU\'s) that cause problems. If the program is restarted and attempts to revert to one of these faulty checkpoints, it immediately jumps forward to the end the line, instead of going backward to a previous, valid checkpoint (or to the beginning of the WU). And, this is what causes \"Empty Outputs\" to be generated.

                        But not always. I had to interrupt a five-line task recently; it appears to have resumed where it left off, or at least started the line over, because it validated successfully.
                        ____________

                        Stick
                        Send message
                        Joined: Jun 12 06
                        Posts: 193
                        Credit: 66,271
                        RAC: 0
                        Message 5739 - Posted 21 Feb 2007 13:33:27 UTC - in response to Message 5737.

                          But not always. I had to interrupt a five-line task recently; it appears to have resumed where it left off, or at least started the line over, because it validated successfully.


                          You are right. (I should have been more clear.) At one time, I thought that all mid-line checkpoints were faulty, but apparently some are OK and some are not. However, it appears that the faulty ones are more frequently encountered on 1, 2 and 5 line WU\'s. I have previously speculated that the checkpointing bug is an overflow problem. (But, as has been noted fairly often here, Adam has not responded on this issue, so we really don\'t know what may be causing the problem).
                          ____________

                          wstomv
                          Send message
                          Joined: Jan 16 07
                          Posts: 4
                          Credit: 603
                          RAC: 0
                          Message 5826 - Posted 4 Mar 2007 22:39:19 UTC - in response to Message 5716.

                            My long running WU that started this thread finally finished today, BUT it got no credit. The result is Success. You can imagine I am disappointed after such a long run.

                            What is the matter?

                            You don\'t really need to report by the deadline in this project. When you miss it, a replacement WU will be created and after about 4 days sent to someone else. If you report before that time, your result will be accepted and the replacement WU deleted.

                            Even after that day, you may continue. I have the strong feeling that your result will not be enough to get the necessary quorum of 3, as the other 2 results have such short processing times compared to yours. As long as the quorum is not reached, your result is welcome. I\'m crunching now a WU with deadline 17 January, because it still needs valid results.


                            ____________

                            Profile Nightbird
                            Forum moderator
                            Avatar
                            Send message
                            Joined: Jul 12 05
                            Posts: 920
                            Credit: 114,924
                            RAC: 0
                            Message 5827 - Posted 4 Mar 2007 23:00:13 UTC

                              I don\'t see the resultid = 432619 but only the resultid = 432660
                              Is the result (already) purged ??
                              ____________

                              wstomv
                              Send message
                              Joined: Jan 16 07
                              Posts: 4
                              Credit: 603
                              RAC: 0
                              Message 5835 - Posted 5 Mar 2007 6:17:09 UTC - in response to Message 5827.

                                I don\'t see the resultid = 432619 but only the resultid = 432660
                                Is the result (already) purged ??


                                My apologies. This is my result, and now I see it is listed as Invalid.

                                What a waste of my computer time (and energy).
                                ____________

                                Post to thread

                                Message boards : SZTAKI Desktop Grid : Long running WU threatening to miss deadline


                                Home | My Account | Message Boards


                                Copyright © 2017 SZTAKI Desktop Grid