Skip to end of metadata
Go to start of metadata

The Terminology Task Force recommends two tools to help you locate content that might contain words on the Terminology List:

Searching website content with Sitebulb

Sitebulb is a client side tool that can be used to crawl a website and search for terms in the Terminology List.  XSEDE has purchased licenses to be used by XSEDE staff.  

  1. Download Sitebulb from website (Mac and Windows available) and install.

  2. Open the Sitebulb app and create an account.

  3. Email/Slack Shava or SusanL and request a Sitebulb license.  The Sitebulb license will be issued to the email address you provided in your Sitebulb account.  Once confirmed a license has been issued, click on the 'My Account' link in the top right corner and verify that see an entry under Licenses.

  4. Click the green 'Activate License' button and you should see the machine name show up as below when successfully activated.

  5. To get started, click on the "Projects" link in the top left corner and then click on the green "Start a new Project" button.  
    1. Fill in a project name and the URL for your site.  You can use as the Start URL to get familiar with the tool.  
    2. Set the Device type to 'Desktop'.
    3. Uncheck the "Crawl Outside of Directory" checkbox. 
    4. If your site uses Javascript, you should set the Crawler Type to Chrome.   
    5. Hit the "Save and Continue" button.

  6. Once saved, you can set up the conditions for a Sitebulb "audit".  Each project may contain a series of audits. The default settings will do a detailed analysis of your web pages.  We recommend you turn off "Search Engine Optimization" on a first try since that will make the scan go faster.  You can also turn off "Page Resources" as well for a more efficient scan – it just gives you a summary of the types of content it found on your website:
  7. Click on the "Content Search" link in the left menu and click the green "Add Multiple Rules" button.

  8. Copy the list of terms from this list and paste it into the Basic tab text box, select "HTML and Text" in the Search In box, then click the green "Add Rules" button.

  9. From the Project Settings page, click on the "Crawler Settings" link from the left menu to view the crawler settings.  If your machine is powerful, you can increase the Instances of Chrome to a number above 5 and that will make the crawl go faster.

  10. If you want to exclude any URLs from your site, you can use the "URL Exclusions" tab to specify areas of your website that you want Sitebulb to ignore – e.g., /sitebulbtest/testnocrawl/*

  11. Click the "Start Now" button.
  12. Sitebulb will display a status board as it crawls the website  – you can see the URLs it's crawling under the "URL Log".  If you are doing a real site and it's browsing areas of your site you don't want it to, you can press the "Stop" button.  If it's taking too much CPU, you can press the "Pause" button and adjust the configuration to a smaller number of instances under "Update Settings".
    1. Note that Redirects are not really errors – it's just informing you that it's getting bounced to another page.  Some sites have more redirects than others.
  13. Once it's complete, you will see an overview page as below.  You can see some overall stats about the site.
  14. Click the "Content Search" link in the left menu tab to see the Terminology search results as below.  The Overview tab shows the terms listed as rows and the number of times it was found on your site. The demo site had four words on the Terminology list like "Webmaster" and "White Paper" so those are listed first.  
  15. To see the specific URLs that the terms were found on, click the "URLs" tab next to overview and that will give you table the list of the pages that were found with terms are the rows and the terms found on that page are listed in the columns.  For example, in the screenshot below, "Webmaster" and "White Paper" were found on subpage.html.  The columns are also sortable so if you have more than one page where the terms were found, you can click on the column name to have it sort the list by the number of times a term was found on pages.
  16. To share the list with others in your group, click the green button "Export All Search Data → Export to CSV" and it will pop up a screen like below so you can save the results to a local file

Search single file within Google drive

You can use the regular expression feature of Google drive to search a single document for terminology.  E.g., under Edit → Find and Replace,

Click the checkbox next to "Match using regular expressions" and then paste a "|" delimited list of search terms into the Find box.  For your convenience, the current list of terms (as of 2022-04-13) with "I" delimiters is:  

Abort|American|Black Box|Black Mark|Blackball|Blacklist|Blacklisted|Blind Review|Blind Study|Brown Bag|Chief|Dumb|Dummy Value|Dummy|First Class Citizens|First Class|Flow Master|Freshman|Grandfathered|Gray Beard|Gray Hat Hacker|Guys|Illegal Immigrant|Illegal Alien|Indian|Male Connector|Female Connector|Male Fastener|Female Fastener|Man Hours|Master Slave|Master Branch|Master|Slave|Minority|Mob Programming|Mob|Native|Native Feature|Native Speaker|Non-Native Speaker|Red Team|Rule of Thumb|Sanity Check|Sanity|Scrum Master|Slave|Submit|Submission|Submitting|Tarball|Tar Ball|Tone Deaf|Tribal Knowledge|User|Webmaster|Web Master|White Hat |Black Hat|White Paper|White Team|Whitelist|Whitelisted|Yellow Team

Make sure there are no leading or ending spaces in your paste.  Click the Next button to scroll through any occurences it finds.  You can also move the whole "Find and Replace" window if the text is located underneath the window itself.

Search Google drive content with Python script (still in progress)

To be added.

  • No labels