Task #97

Feature #64: inventory current Occupation Directory sources

Create master spreadsheet and import/merge data from all available sources

Added by Andrew Mallis over 1 year ago. Updated about 1 year ago.

Status:Resolved Start date:01/23/2012
Priority:High Due date:02/10/2012
Assignee:Andrew Mallis % Done:

100%

Category:data Spent time: 160.50 hours
Target version:0.6.0 Estimated time:12.00 hours
Difficulty:average

Description

fgadir-master-111231-230AM.ods (359.2 kB) fourth man, 12/31/2011 05:56 am

fgadir-master-120101.ods (169.9 kB) fourth man, 01/01/2012 09:20 pm


Subtasks

Bug #186: Standardize Country and Province CodesResolvedLippe Lippe

Task #214: Merge 4th man data into MASTER sheetResolvedInga Jensen


Related issues

related to #OWSdirectory - Feature #117: Import Occupations List into Drupal Resolved 12/15/2011
related to #OWSdirectory - Feature #111: STORY: Architect and Build a data object (content type) w... Closed 12/14/2011 01/09/2012
related to #OWSdirectory - Task #115: coordinate information interchange with other occupation ... Resolved 12/15/2011
duplicated by #OWSdirectory - Task #110: Collate existing data & sources Closed 12/14/2011

History

Updated by Devin Balkind over 1 year ago

Existing "master" spreadsheet that volunteers are actively adding occupation rss feeds into:

https://docs.google.com/spreadsheet/ccc?key=0AvGeaSVQPEYUdEpYYUpGTnpLUTljQzVqQ3VMeUNxZnc

Updated by Lippe Lippe over 1 year ago

Okeh, to summarize, we have 3 main sources identified now (from the parent feature thread) in addition to the link posted by Devin:

Scraping & diffing these is beyond my rudimentary skillset.
Shall we reach out and ask for the source data, presuming it's not just coded into an HTML table?
:)

Updated by Andrew Mallis over 1 year ago

  • Assignee set to Lippe Lippe
  • % Done changed from 0 to 10

I've opened a subtask for outreach here.

Yes, raw data = ideal. They should be acutely aware of what we're doing also.

Let's collate into a google doc after that. Others can assist with data transforms.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from Lippe Lippe to fourth man

Updated by fourth man over 1 year ago

  • Due date changed from 12/14/2011 to 12/19/2011

Updated by fourth man over 1 year ago

  • Due date changed from 12/19/2011 to 12/29/2011
  • % Done changed from 10 to 20
  • Estimated time changed from 4.00 to 6.00

Updated by fourth man over 1 year ago

  • % Done changed from 20 to 40
  • Estimated time changed from 6.00 to 8.00

Should have some giant source-indexed collated lists, in raw & deduplicated form from these lists & any other interesting ones I can ID before I go to sleep tonight.

At this point have pulled all the sources identified, normalized them to some degree, & have sketched out the data merge on paper.

I'm going in.

Updated by Andrew Mallis over 1 year ago

update please?

The heat is on for this project.

Updated by fourth man over 1 year ago

  • Due date changed from 12/29/2011 to 12/30/2011
  • % Done changed from 40 to 80
  • Estimated time changed from 8.00 to 24.00

I've managed to sufficiently normalize the three biggest / highest quality sources into one master database of ~2750 entries describing US occupations & points of contact. After a scrub for crank entries (WeAllOccupy has a lot), I'll have two giant lists of US occupation entities sorted out:

- List of cities, probably with the OccupyTogether submitted names (best quality source in bunch) Key is a humanreadable "city-ID".
- List of unique locales, where a local is a sub-location, a specific "site of resistance", or other. Key is a human readable "entityID", which is cityID + locale where it has one, to differentiate from purely city-based entities.

These will all have normalized state data and pretty close to normalized city data. The locales (again, representing specific sites or "sites of resistance") I put in manually based on research & inference from the naming (the OccupyTogether sheet winning from this list. That chewed up a lot of time, but it gives us an important base to distinguish cities from actual entities.

Sources aggregated:
- OccupyTogether DB (best of lot, but horribly formatted contact data)
- We All Occupy (good contact data, lots of crap entries)
- Wikipedia (b/c of size data, no real contact data but good point of comparison)

I'll update this ticket with the working spreadsheets in addition for folks to sort through, if you dare. Beware that each is large enough to crash an underpowered machine.

What this drop won't have is contact data, whose horrific formatting and quality issues are why there are 3K entries to sift through. This is tomorrow / this weekend's work.

Next drop will be giant list of Web sites + Twitter feeds I can discover from this set of sources, by city / entity. This I should be able to sort tomorrow.

Next drop will be direct contacts - e-mails, phones, including WG information that qualifies exactly who's there. This should come before end of weekend. This will be hardest based on how mail data is stashed all over DBs.

Final drop - after I process another big motherlode of feed data - should have a bigger set of public feeds, if we need to take it there.

Updated by fourth man over 1 year ago

PS: I have never seen a data set this messy that wasn't hanging on a mother's fridge. My kingdom for a drop-down list...

Updated by fourth man over 1 year ago

Drop 1 in ODS format for now.

First two pivot reports show the master lists of City & Multi-City and Site Specific Occupation entities I've assembled from the initial 3 massive lists of ~3000 entries. Note there is still cruft to sift through, but most obvious crap has been weeded out.

More important are the two "databases" containing every entry normalized from the lists so far, now mapped to common identifiers, & with a system in place that makes sucking in the remaining lists so much faster.

I ended up splitting up the city-based vs. the statewide / multistate / site-specific entities into different databases. It already starts to make sense, if you look at the variance in the data set / naming / etc.

Now, sleep. Tomorrow, I hoover in all the Web sites & Twitters and whatnot I can from the DB sheets. And crank out machine readable delimited formats.

Apologies for any gnarl, circumstances required me to bounce between MS Office & OpenOffice.

Updated by Andrew Mallis over 1 year ago

  • Priority changed from High to Urgent

solid. totally above and beyond.

It is critical that we pull in more then just US entries for this initial launch. I see your logic supports this, but only see US data thus far.
I'd like to make sure this is the vector you are currently on.

Please take another look at the google spreadsheet in the ticket description to get an idea of the fields we're expecting for each occupation.
We do need RSS feed info in there this round. It's important to eliminate duplication from http://news.occupy.net who is maintaining a parallel list

I am (only) slightly concerned by the complexity of your solution, as it is a little difficult to bring our external partners in to help normailze/verify the data before import. I'm willing to defer to your expertise here, and put all our eggs in your basket, as I am totally confident you'll get the job done keeping in mind that the absence of a data merge is currently the biggest blocker to the directory project moving forward.

Updated by fourth man over 1 year ago

Got it. My first step then will be merging my master lists with clean authenticated lists from the original ticket with the latter's fields all captured. I'll cross reference against news site to ensure coverage there as well.

Two goals:

- Get full coverage of clean feeds from Occupation Feeds sheet & news.occupy.net
- Deduplicate & normalize core data

Ancillary goal: to backfill feeds / any data for entities that might not be covered in this list from other collected sources to date.

That will be my next drop.

Updated by fourth man over 1 year ago

  • Due date changed from 12/30/2011 to 01/01/2012
  • % Done changed from 80 to 90

Data set is looking pretty good. I cleaned the hell out of the master spreadsheet last night and are using that as format master now.

Have to run the indexes again to pull in the encampments I found on other sheets, and cross-reference them to plug holes in feed data.

Good news is where we don't have 'feeds' per se in a lot of cases, we have URLs to fb or meetup services which at least serve as a public info resource from one or more sources, so we should be in pretty good shape.

ETA is no later than 5PM PST 1/1.

Updated by fourth man over 1 year ago

Full merge of Occupation Feeds. Note data dictionary in document for complete explanations of each field / value. Remerged everything into one basic entity type again for simplicity's sake. Will upload to Google Doc later.

Contains about 1,947 unique records, which correspond basically to feeds rather than entities (duplication rate is probably ~30%, so maybe 1,500 "occupations" here?) Added a huge amount of international sites (200+ deduplicated) and a few dozen smaller occupations in the US, from the other source lists.

This is IMHO the last major task of the data merge, though this is technically enough to get started: this represents imperfectly a list of feed entries, where each line has 0-5 actual feeds / URLs entered. So it's not quite a unique / deduplicated list of entities, nor it it a representation of feeds.

Also needs a tiny bit more data scrubbing and deduplication, but some of that can happen off the DB perhaps.

I can take a task to do one more merge of this thing so it refers specifically to feeds, then extract the unique entities for a cleaner separation. That would make a bunch of sense to me, so we're at least dealing with one unit of info: record.

Updated by Andrew Mallis over 1 year ago

Under the "Web Site" column, there seem to be a lot of links to feeds, i.e. http://occupyedmonton.org/feed and http://occupydallas.org/rss.xml instead of links to the web site proper.

Can we please eliminate all instances of "(none found)" and "(na)" so that cells are simply blank instead?

I am not sure where you are coming from with "best available URL" and "Best available public contact" as these weren't in the spec and I cannot see how we can implement this inside Drupal. In the end, what is best is not something that we can determine. It's really up to an occupation.

Did we loose all the province/territory information for non-us occupations? This info was in our sources.

In the "data dictionary" I see "FB / Other PublicContact" as a proposed field. I am unclear if this announces the possibility of non-FB URLs appearing in a cell or if it is just shorthand for 2 columns. It should be 2 columns, not 1.

I really appreciate all the time you are putting into this project, but please try and scale back/cut some corners. I see you've added another 12 hours to the deliverable, which seems quite significant to me. We need to get this out there in the wild asap, where the community can help us manage the information.

Updated by Andrew Mallis over 1 year ago

  • % Done changed from 100 to 70

Updated by Andrew Mallis over 1 year ago

Thanks again for your dedication on this project. As I mentioned, our goal right now is to get the data into google docs asap, so we can motor through sanitizing it, and refine our established data import process.

Master google doc

Please look at the FGA Occupation MASTER directory google doc

this is the end-state we are going for as part of this initial merge. We need to migrate your openoffice document into this space as soon as possible. We can, by hand, do a lot in this format collectively, and it is this document that will be imported into Drupal.

Carl's data

Please look at Carl Levinson's curated document

Carl claims to have spent 400 hours sanitizing all the links contained therein. values therein can be considered more definitive than what we have at this time from our other sources.

I've spent some time moving his data set around, and massaging it to match our google doc master. There are a few legacy fields at the end I still have to decide where to contact them to. I have imported the Harmonized version of Carl's spreadsheet to the master document as a tab.

Please move your data, once refactored to match the Master spreadsheet's format, into a tab there also. We'll be bringing the two together in the main space, and manually merging the data. Carl's values for phone #s and facebook pages will take precedence, trusting that he's vetted the info.

cityID

The "cityID" column's purpose, as I understand it, is to help manually de-dupe records. Please do not be concerned with carrying over the 'cityID' column if it takes more than a few minutes, as similar (good enough) sort results can be achieved by a sort on the city name, or by combined sorts (sort by range).

website field

Inside the website field, there are feeds which should be in the feed column (base URL should be in the website column) and facebook links in the FB column. We can very easily manually move this stuff around once in the google doc, but if you wanted to take a few minutes to do some, it would be welcome.

legacy data

There are a number of dubious fields in the Feeds source. From "moreinfo-events/video/articlesplus (From Occupation Feeds List)" we can parse URLs into their respective fields with a simple sort and cut-paste, and, whatever relevant remains can get go into the "legacy notes" field.

I don't think we need much else from the blue-colored data inherited from the feeds document.

Updated by fourth man over 1 year ago

Two primary accomplishments:

1: Prepared the Levinson data set for import in the MASTER LIST spreadsheet. The main known issue with the data remains that State / Province / Territory data has yet to be added to most cities outside the US.

This is theoretically ready to import as it contains very clean data. We know it's not comprehensive yet, but it's in good shape & has a nice big data set...

2: Created the draft data merging / prep template in Google Docs. This should be a useful tool to

(a) paste in a new spreadsheet source
(b) efficiently manually scrub the data from that source
(c) identify new occupations vs. updates to existing in the Master List (those calculated filters to the left)
(d) prepare for direct copy & paste into Master LIst.

There is some formula gnarl here I need to fix for this to be useful; the autocalcs are not currently functioning correctly. GDocs porting factors...

You can view the draft here: https://docs.google.com/spreadsheet/ccc?key=0AkzuwID8P3c8dDdNNzR6WHRWRnRrQ0sxSGQzY3pTQlE&hl=en_US#gid=0
Outline of procedures here: https://docs.google.com/document/d/1riAJhv-gqVoi-mjK1PUw2nC3yejRNAB7ma6nBTO-Uxs/edit?hl=en_US

Updated by Andrew Mallis over 1 year ago

Looking good!

I have a couple questions for you. Please be patient with me!

Can you please confirm that began with the data here (the Merge by Carl Levinson-harmonized tab), then massaged the data.

Can you please elaborate a little on what you mean by prepared?
i.e. states were changed from 2-letter codes to full spellings, "world" was changed to "global"…

What is the relationship between the google doc you've posted here and the one referenced in the original ticket? On that sheet, I now see at tab called "import-master-4th" there. What is its significance in this project.

Which sheet are you proposing as the master?

Can you please elaborate on the status of your data set, which I am assuming is what currently resides in the "Data Prep TEMPLATE". Is this still being worked on?

Updated by fourth man over 1 year ago

Andrew Mallis wrote:

Can you please confirm that began with the data here (the Merge by Carl Levinson-harmonized tab), then massaged the data.

That is correct.

Can you please elaborate a little on what you mean by prepared? i.e. states were changed from 2-letter codes to full spellings, "world" was changed to "global"…

Here is what was massaged:

- Fixed alignment of links to correct columns (moved facebook.com links > FB column; sorted Twitter accts/ hashtags; etc.)
added source tag indicating cl as source

Changes to original data...
- Normalized statewide occupations (tagged "all of" in city column previously) to "Statewide" for more specificity & semantics
- Normalized worldwide / non site specific occupations to "global" (in case of "WORLD" or "All" in all location fields)
- Normalized to "regional" tag where it was an international regional / continental scope ("Africa"), or a region of multiple states in a country ("The Northeast")

These can be easily reverted to the original tags. There are relatively few.

Reverted to full names rather than two letter codes because after testing process a few times, it became clear that
a) changing loc from name -> two letter code was a HUGE timesuck
(b) easier to do the other way from a well formatted name (using some indexes I prepared)
(c) made it actually HARDER to visually scan & fix typos, since the average reader recognizes full names > two letter codes.

After redoing the process of preparing the levinson data by fixing names vs. translating to codes, I found ~25% time savings just getting good names in.

What is the relationship between the google doc you've posted here and the one referenced in the original ticket? On that sheet, I now see at tab called "import-master-4th" there. What is its significance in this project.

Which sheet are you proposing as the master?

The one referenced in the original ticket is deprecated work product. Ignore it.

The only deliverable here is the workbook referenced in the most recent update, at https://docs.google.com/spreadsheet/ccc?key=0AkzuwID8P3c8dDdNNzR6WHRWRnRrQ0sxSGQzY3pTQlE&hl=en_US#gid=0
The "Master List" sheet is the proposed master for import, currently containing very clean Levinson data; the other sheet in the book is the prep template, the bridge from original source -> master list.

Can you please elaborate on the status of your data set, which I am assuming is what currently resides in the "Data Prep TEMPLATE". Is this still being worked on?

The data currently staged in the "data prep" worksheet is a further cleaned up version of the last merge I dropped in just after New Year's. This data set does still require a final scan, and for the actual work of its resources to be merged with the now Levinson-based Master & non duplicates added as new entries. Theoretically, anyone who followed step 6) below could pick this up, except that I have one bit of fixing to do.

I still need to fix the formulas on the far right of the data prep sheet (arrays work somewhat differently in Google Docs.) Those formulas + the filter are the mechanism for comparing the data being prepared to the existing data in the master sheet, critical to the final step below.

The exact workflow / relationship between these spreadsheets is,

1) The user gets a new source in spreadsheet form and uploads it to Google docs.

2) Open this spreadsheet with the master import list + a blank "data prep" sheet.

3) Copies data fields from the original source into the Data Prep spreadsheet.

4) Adapts the data as needed so that full location data is available. This is needed to compare to the Master List. For example: if the original source sheet has one 'location' field that munges all fields together, the user would go through the locations and update the individual city / state / country fields manually.

5) Once the location data is clean, use the filters at left to
a) "exact name matches" indicate when the name == one or more entries on the Master List: in this case, you look at the resources (links etc.) to see if any new ones are in the source, and manually add them to the Master List fields.
b) "city matches" indicate one of two things: same occupation, different naming; or same city, different occupation. In the former case, you would scrape this for new resources, add them to the Master List entries manually.
c) non-city or name matches: these may be entirely new Occupation entries. Once you scan them visually, you may clean them (by making sure resources & links are in right column), then copy & paste the entries into the master.

When finished,
a) You have identified all the new resources in the "data prep" spreadsheet associated with existing occupations, and added each to the appropriate column in the entry in the "master list".
b) You have identified all new occupation entities in the "data prep" spreadsheet, and pasted new entries in the "master list".
c) The Master List is now a full representation of all known entities and resources, and ready for re-import (implying a destructive / full import mode.)

If you had a non-destructive mode or didn't list multiple resources in one cell, we might have a few more options here, but as is the biggest bit in this is manually adding resources to the master list.

Updated by Andrew Mallis over 1 year ago

  • Assignee changed from fourth man to Andrew Mallis

Thanks for all the hard work and normalization.

Reverted to full names rather than two letter codes because after testing process a few times, it became clear that a) changing loc from name -> two letter code was a HUGE timesuck (b) easier to do the other way from a well formatted name (using some indexes I prepared) (c) made it actually HARDER to visually scan & fix typos, since the average reader recognizes full names > two letter codes.

After redoing the process of preparing the levinson data by fixing names vs. translating to codes, I found ~25% time savings just getting good names in.

I can appreciate how this change can help in your process, however for the purposes of our data import, as per this recent note in ticket 117 we need state/prov codes. I should have made this clearer as soon as it was known, and accept full responsibility. I will manually re-do the codes.

Please correct me if I'm misunderstanding, but, aside from reworking the formulae in the grey columns, you're position is that you're currently hands off the data.

At this time, I am re-assigning ownership of this ticket, and the data to myself, so we can coordinate additional manual munging and merging of the data.

I have renamed the document at
https://docs.google.com/spreadsheet/ccc?key=0Aldd4WFocycHdEdCVEtBNzFSZ0NUQlpBR1pBN2xON0E
to FGA Occupation MASTER Directory

and updated this ticket description to point there.

I renamed the original target at https://docs.google.com/spreadsheet/ccc?key=0Aldd4WFocycHdDNXQ25ZRGllajNxWkdWQ1FIR210Wnc
FGA Occupation MASTER Directory -- OLD

Updated by Andrew Mallis over 1 year ago

Had a great inter-occ call with the owners of most of the data listed in this ticket.

Notes from the call (google doc)

Updated by Inga Jensen over 1 year ago

Hi all, just wanted to put myself in to the mix on this aspect of the project. Will await the result of our meeting on Wednesday Jan 25 to start anything. In the mean time will be taking a look at the current MasterList and contemplating :)

Updated by fourth man over 1 year ago

I had to disappear for a while to attend to a personal issue. Couldn't be helped.

I have basically completed the merge utility in Google Docs, bridging to the point where you can have correction capability via the Web interface. Don't see myself able to commit to much day to day manual work given various distractions.

I am more or less done, having stripped the CityID stuff out for simplicity's sake. It was more trouble than it was worth after further testing.

I did get the formulas there working and revised a few to not depend on particular forms of city or state. Using those formulas + Google Docs filters, you should have a slightly easier time sifting through the spreadsheet-merging stage of this.

If to you go to "4th_data_prep_test",

(1) Select the whole sheet
(2) Autofilter the data ("Data" menu, "Filter"); make sure all rows are selected for the filter.

You can now filter each record in this spreadsheet to see whether it matches records already on the MASTER LIST sheet.

(3) To find entries likely already on the Master - with exact name matches - filter by the "Exact Name Matches" column (AB) for all entries where this is >0; these entries already have an entry somewhere on the MASTER LIST, and the manual task from here is to review the links / resources and see if any new ones need to be added to master.

(4) To find entries who match a city already on the Master List but no Name match, filter the "Exact Name Match" to "=0", then filter by the "City Name Matches" (column AC, >0). Some of these results may be dupes with inconsistent names / cities, but it's also possible they are new Occupations. Idea is to filter this list down for manual review, then add occupations and/or resources to the MASTER manually as needed.

(5) To find probable totally new entries, filter both "Exact Name Match" and "City Match" to zero. These results are either quite incomplete records, or they are new entries. Manual review, update MASTER list accordingly.

To help with translation back / from the 2 digit codes, I did upload to Google Docs /data sources/ folder a handy index to the shared directory of (1) all ISO 2 digit country codes and (2) all US state 2 digit codes. hopefully will be useful.

Idea is, any time you get a big munge of data to merge, you can Duplicate-spawn another copy of this sheet, paste it in, clean it up, then compare it to the master for manual review & merging of resources.

The manual aspect of which is unfortunately necessary since you did choose to place all resources where there are multiple of a type in comma-separated lists within a single cell. A programmatic approach would have been more achievable if you had separated (1) an index of resources (2) an index of entities as was suggested at one point; but it's marginally more difficult to automate something in this type of format that works on values within a cell vs. contents of a cell in their entirety.

Updated by Andrew Mallis over 1 year ago

  • Category set to data

Updated by Andrew Mallis over 1 year ago

this is all really, really great. Thanks so much for the extra push.

Going to spawn these instructions into a wiki page and a screen cast

looking also to create more discrete issues for data merges, and for refining the MASTER sheet, based on inter-occ collaboration workflow discussions

Updated by Andrew Mallis over 1 year ago

  • Status changed from New to In Progress

Updated by Andrew Mallis over 1 year ago

  • Target version changed from 0.5.0 to 0.6.0

The Guardian data has been fully merges, which provides start dates to a number of occupations and enriched location data.

The 4th man set is 80% done!

Updated by Andrew Mallis about 1 year ago

  • Status changed from In Progress to Resolved

Also available in: Atom PDF