Depending if i need to cancel/puase/resume also the backgroundworker or only the recursive loop.
First this is the class where i have the recursive loop:
public List<string> webCrawler(string mainUrl, int levels) { // to make config. to all variables here and in the offline function ! // CancelAsync to abort the process to return without doinf the work return back without return anything. // To check about timeout when loading url/s // to check the site familymediation.co.il // *** To save/keep all settings like url change and levels to crawl and all checkboxes and options in designer to keep/save while program is running *** \\ //levels = levelsToCrawl; List<string> csFiles = new List<string>(); wc = new System.Net.WebClient(); HtmlWeb hw = new HtmlWeb(); List<string> webSites; csFiles.Add("temp string to know that something is happening in level = " + levels.ToString()); csFiles.Add("current site name in this level is : " + mainUrl);//wccfg.url); /* later should be replaced with real cs files .. cs files links..*/ try { HtmlAgilityPack.HtmlDocument doc = TimeOut.getHtmlDocumentWebClient(mainUrl, false, "", 0, "", ""); if (doc == null) { failed = true; wccfg.failedUrls++; failed = false; } else { done = true; // progress should be reported here I guess Object[] temp_arr = new Object[8]; temp_arr[0] = csFiles; temp_arr[1] = mainUrl; temp_arr[2] = levels; temp_arr[3] = currentCrawlingSite; temp_arr[4] = sitesToCrawl; temp_arr[5] = done; temp_arr[6] = wccfg.failedUrls; temp_arr[7] = failed; OnProgressEvent(temp_arr); /* if (doc == null) { this.Invoke(new MethodInvoker(delegate { Texts(richTextBox1, " Check The Link" + Environment.NewLine, Color.Green); })); return csFiles; }*/ //string html = doc.DocumentNode.InnerHtml; //get Text //string pageText = doc.DocumentNode.InnerText; //doc = hw.Load(url); currentCrawlingSite.Add(mainUrl); webSites = getLinks(doc); removeDupes(webSites); removeDuplicates(webSites, currentCrawlingSite); removeDuplicates(webSites, sitesToCrawl); // to add a filter to check in the List webSites if there are any links to files like photos: jpg bmp gif and more. // if in the List there are any images links to download them in the filter method as files and then remove them from the List webSites.s if (wccfg.removeext == true) { for (int i = 0; i < webSites.Count; i++) { webSites.Remove(removeExternals(webSites, mainUrl, wccfg.localy)); } } if (wccfg.downloadcontent == true) { retwebcontent.retrieveImages(mainUrl); // to check when its not // when im using and calling the function to retrieve images the program is not working good not crawling ! to check why. } // maybe something like this : if (levels > 0) sitesToCrawl.AddRange(webSites);// we want this to grow..(but not in the most deep level..cause we are not going to dive anyway in this level) /* to call here the duplicates function with current sites with the sites visited *\ to call again the duplicates function with the same currentsites with the list number 2 in the form level the sites im going to visits them ! the list webSites are the links im going to visit im adding to sitestocrawl /* to filter/clean same sites already when gewtting all links here ** /* /* 2DO: webSites = FilterJunkLinks(webSites); // keeps only things that start with http or https.. and maybe * remove self site.. or other junk.. * */ if (levels == 0) { return csFiles; } else { for (int i = 0; i < webSites.Count(); i++)//&& i < 20; i++) // limiting ourseleves for 20 sites for each level for now.. //or it will take forever. { //int mx = Math.Min(webSites.Count(), 20); if (wccfg.toCancel == true) { return new List<string>(); } string t = webSites[i]; if ((t.StartsWith("http://") == true) || (t.StartsWith("https://") == true)) // replace this with future FilterJunkLinks function { csFiles.AddRange(webCrawler(t, levels - 1)); } } return csFiles; } } return csFiles; } catch (WebException ex) { failed = true; wccfg.failedUrls++; return csFiles; } catch (Exception ex) { failed = true; wccfg.failedUrls++; throw; } }
This function is recursive it keep calling it self over and over again.
In Form1 i have a cancel button event click:
private void button3_Click(object sender, EventArgs e) { bgwc.CancelWorker(); cancel = true; wcfg.toCancel = cancel; }
bgwc is a class where i hold and start two backgroundworkers:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; using HtmlAgilityPack; using System.Net; using System.Windows.Forms; using System.ComponentModel; using System.Threading; namespace GatherLinks { class BackgroundWebCrawling { public string f; int counter = 0; List<string> WebSitesToCrawl; int MaxSimultaneousThreads; public BackgroundWorker mainBackGroundWorker; BackgroundWorker secondryBackGroundWorker; WebcrawlerConfiguration webcrawlerCFG; List<WebCrawler> webcrawlers; int maxlevels; public event EventHandler<BackgroundWebCrawlingProgressEventHandler> ProgressEvent; ManualResetEvent _busy = new ManualResetEvent(true); // mainBack to make doWork event // function that called start that get list<string> websitestocrawl anbd get int number how many threads to work at once ! in Form1 to make start to be public // to make progress event to the main background ! mainBack first time to make it to work without progress maybe ! // to make getList function to get all lists from the webcrawlers instances ! // in the mainBack to check and create instances for secondryBackground each time. // to the seocndry to make DoWork as it is now ! // in the secondry to call the webcrawler from the class WebCrawler . // since WebCrawler is a class in the main background DoWork event before calling for a new thread for the secondry to make a new instance to the webCrawler class !!! // to call to the FUNCTION webCrawler to use the e.Argument to get create instance of the webCrawler function. // to make the progress event in the end to report for labels and richtextboxes in Form1. maybe in Form1 to use invokes for this event. // the function start will get parameters she need and that webcrawer need from webcrawler cfg class. this list to send to each webcrawler instance. public BackgroundWebCrawling() { webcrawlers = new List<WebCrawler>(); mainBackGroundWorker = new BackgroundWorker(); mainBackGroundWorker.WorkerSupportsCancellation = true; mainBackGroundWorker.DoWork += mainBackGroundWorker_DoWork; } private void mainBackGroundWorker_DoWork(object sender, DoWorkEventArgs e) { try { BackgroundWorker worker = sender as BackgroundWorker; for (int i = 0; i < WebSitesToCrawl.Count; i++) { _busy.WaitOne(); if ((worker.CancellationPending == true)) { e.Cancel = true; break; } while (counter >= MaxSimultaneousThreads) { Thread.Sleep(10); } WebCrawler wc = new WebCrawler(webcrawlerCFG); webcrawlers.Add(wc); counter++; secondryBackGroundWorker = new BackgroundWorker(); secondryBackGroundWorker.DoWork += secondryBackGroundWorker_DoWork; object[] args = new object[] { wc, WebSitesToCrawl[i] }; secondryBackGroundWorker.RunWorkerAsync(args); } while (counter > 0) { Thread.Sleep(10); } } catch { MessageBox.Show("err"); } } private void secondryBackGroundWorker_DoWork(object sender, DoWorkEventArgs e) { try { object[] args = (object[])e.Argument; WebCrawler wc = (WebCrawler)args[0]; string mainUrl = (string)args[1]; wc.ProgressEvent += new EventHandler<WebCrawler.WebCrawlerProgressEventHandler>(x_ProgressEvent); wc.webCrawler(mainUrl, maxlevels); counter--; } catch { MessageBox.Show("err"); } } public void Start(List<string> sitestocrawl, int threadsNumber, int maxlevels, WebcrawlerConfiguration wccfg) { this.maxlevels = maxlevels; webcrawlerCFG = wccfg; WebSitesToCrawl = sitestocrawl; MaxSimultaneousThreads = threadsNumber; mainBackGroundWorker.RunWorkerAsync(); } private void x_ProgressEvent(object sender, WebCrawler.WebCrawlerProgressEventHandler e) { // ok .. so now you get the data here in e // and here you should call the event to form1 Object[] temp_arr = new Object[8]; temp_arr[0] = e.csFiles; temp_arr[1] = e.mainUrl; temp_arr[2] = e.levels; temp_arr[3] = e.currentCrawlingSite; temp_arr[4] = e.sitesToCrawl; temp_arr[5] = e.done; temp_arr[6] = e.failedUrls; temp_arr[7] = e.failed; OnProgressEvent(temp_arr); /// send the data + additional data from this class to Form1.. /// /* * temp_arr[0] = csFiles; temp_arr[1] = mainUrl; temp_arr[2] = levels; temp_arr[3] = currentCrawlingSite; temp_arr[4] = sitesToCrawl;*/ } private void GetLists(List<string> allWebSites) { } public class BackgroundWebCrawlingProgressEventHandler : EventArgs { public List<string> csFiles { get; set; } public string mainUrl { get; set; } public int levels { get; set; } public List<string> currentCrawlingSite { get; set; } public List<string> sitesToCrawl { get; set; } public bool done { get; set; } public int failedUrls { get; set; } public bool failed { get; set; } } protected void OnProgressEvent(Object[] some_params) // probably you need to some vars here to... { // some_params to put in evenetArgs.. if (ProgressEvent != null) ProgressEvent(this, new BackgroundWebCrawlingProgressEventHandler() { csFiles = (List<string>)some_params[0], mainUrl = (string)some_params[1], levels = (int)some_params[2], currentCrawlingSite = (List<string>)some_params[3], sitesToCrawl = (List<string>)some_params[4], done = (bool)some_params[5], failedUrls = (int)some_params[6], failed = (bool)some_params[7] }); } public void PauseWorker() { if (mainBackGroundWorker.IsBusy) { _busy.Reset(); } } public void ContinueWorker() { _busy.Set(); } public void CancelWorker() { ContinueWorker(); mainBackGroundWorker.CancelAsync(); } } }
In Form1 i start with a button the first backgroundworker that start the second backgroundworker and the second backgroundworker call the recursive loop .
private void button1_Click(object sender, EventArgs e) { List<string> sites = new List<string>(); List<string> a = (List<string>) listBox1.Tag; foreach (var x in listBox1.SelectedIndices) { sites.Add(a[(int)x]); } wcfg = new WebcrawlerConfiguration(); bgwc = new BackgroundWebCrawling(); wcfg.downloadcontent = downLoadImages; wcfg.failedUrls = failedUrls; wcfg.localy = LocalyKeyWords; wcfg.removeext = removeExt; bgwc.Start(sites, 3, (int)numericUpDown1.Value, wcfg); }
The bgwc start the first background that start the second background that call the recursive loop.
Now wcfg in Form1 is another class for configuration:
using System; using System.Collections.Generic; using System.Linq; using System.Text; using System.Threading.Tasks; namespace GatherLinks { class WebcrawlerConfiguration { public string url; public Dictionary<string,List<string>> localy; public bool removeext; public bool downloadcontent; public int failedUrls; public bool toCancel; public bool offlineonline; public WebcrawlerConfiguration() { } } }
So if i clicked the cancel button so in the recursive loop im checking the the toCancel variable in the configuration class is true and make return empty List and then in Form1 i have completed event:
private void backgroundWorker1_RunWorkerCompleted(object sender, RunWorkerCompletedEventArgs e) { button3.Enabled = false; checkBox1.Enabled = true; checkBox2.Enabled = true; numericUpDown1.Enabled = true; button1.Enabled = true; button2.Enabled = true; this.Text = "Web Crawling"; if (cancel == true) { label6.Text = "Process Cancelled"; } else { label6.Text = "Process Completed"; } button6.Enabled = true; button4.Enabled = false; button5.Enabled = false; listBox1.Enabled = true; }
So when i click the button it does cancel/stop the operation .
But the pause and resume buttons clicks dosent effect/work at all.
private void button4_Click(object sender, EventArgs e) { bgwc.PauseWorker(); label6.Text = "Process Paused"; button5.Enabled = true; button4.Enabled = false; } private void button5_Click(object sender, EventArgs e) { bgwc.ContinueWorker(); label6.Text = "Process Resumed"; button4.Enabled = true; button5.Enabled = false; }
Both button4 and button5 maybe pause/resume the backgroundworker the main backgroundworker but they never pause/resume the recursive loop.
1. How do i pause/resume the recursive loop ? I know how to cancel/stop the loop but how do i pause/resume the loop ?
2. Do i need to cancel/pause/resume the main backgroundworker and the second backgroundworker too or if i cancel/pause/resume the recursive loop is enough ? The backgroundworkers can keep working it dsoent matter or it does ? So maybe i need to cancel/pause/resume all of them all backgroundworkers and the recursive loop togeather ?
3. How do i do it all ? Its a little bit mess with the code. I mean i can cancel the loop its working but im not sure if leaving the backgroundworkers work is good and if not so how do i cancel/pause/resume them all togeather ?
What should be the logic around this recursive loop ?
What i need in general is to be able to cancel/pause/resume the operation of the recursive loop.